Re: Neural Network in hadoop

2015-02-12 Thread Ted Dunning
That is a really old paper that basically pre-dates all of the recent important work in neural networks. You should look for works on Rectified Linear Units (ReLU), drop-out regularization, parameter servers (downpour sgd) and deep learning. Map-reduce as you have used it will not produce

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-14 Thread Ted Dunning
have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com wrote: In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the

Re: issues with decrease the default.block.size

2013-05-12 Thread Ted Dunning
The block size controls lots of things in Hadoop. It affects read parallelism, scalability, block allocation and other aspects of operations either directly or indirectly. On Sun, May 12, 2013 at 10:38 AM, shashwat shriparv dwivedishash...@gmail.com wrote: The block size is for allocation

Re: Wrapping around BitSet with the Writable interface

2013-05-12 Thread Ted Dunning
Another interesting alternative is the EWAH implementation of java bitsets that allow efficient compressed bitsets with very fast OR operations. https://github.com/lemire/javaewah See also https://code.google.com/p/sparsebitmap/ by the same authors. On Sun, May 12, 2013 at 1:11 PM, Bertrand

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

2013-05-11 Thread Ted Dunning
This sounds (with no real evidence) like you are a bit light on memory for that number of cores. That could cause you to be spilling map outputs early and very much slowing things down. On Fri, May 10, 2013 at 11:30 PM, David Parks davidpark...@yahoo.comwrote: We’ve got a cluster of 10x

Re: MapReduce - FileInputFormat and Locality

2013-05-08 Thread Ted Dunning
I think that you just said what the OP said. Your two cases reduce to the same single case that they had. Whether this matters is another question, but it seems like it could in cases where splits != blocks, especially if a split starts near the end of a block which could give an illusion of

Re: Hardware Selection for Hadoop

2013-04-29 Thread Ted Dunning
I think that having more than 6 drives is better. More memory never hurts. If you have too little, you may have to run with fewer slots than optimal. 10GB networking is good. If not, having more than 2 1GBe ports is good, at least on distributions that can deal with them properly. On Mon,

Re: Cartesian product in hadoop

2013-04-18 Thread Ted Dunning
It is rarely practical to do exhaustive comparisons on datasets of this size. The method used is to heuristically prune the cartesian product set and only examine pairs that have a high likelihood of being near. This can be done in many ways. Your suggestion of doing a map-side join is a

Re: Physically moving HDFS cluster to new

2013-04-17 Thread Ted Dunning
It may or may not help you in your current distress, but MapR's distribution could handle this pretty easily. One method is direct distcp between clusters, but you could also use MapR's mirroring capabilities to migrate data. You can also carry a MapR cluster, change the IP addresses and relight

Re: Copy Vs DistCP

2013-04-14 Thread Ted Dunning
matter because once it gets going, it moves data much faster. On Apr 14, 2013 6:15 AM, Ted Dunning tdunn...@maprtech.com wrote: Lance, Never say never. Linux programs can read from the right kind of Hadoop cluster without using FUSE. On Fri, Apr 12, 2013 at 10:15 AM, Lance

Re: Copy Vs DistCP

2013-04-14 Thread Ted Dunning
On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts mathias.herbe...@gmail.com wrote: This is absolutely true. Distcp dominates cp for large copies. On the other hand cp dominates distcp for convenience. In my own experience, I love cp when copying relatively small amounts of data

Re: Copy Vs DistCP

2013-04-13 Thread Ted Dunning
Lance, Never say never. Linux programs can read from the right kind of Hadoop cluster without using FUSE. On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog goks...@gmail.com wrote: Shell 'cp' only works if you use 'fuse', which makes the HDFS file system visible as a Unix mounted file

Re: Bloom Filter analogy in SQL

2013-03-30 Thread Ted Dunning
This isn't a very Hadoop question. A Bloom filter is a very low level data structure that doesn't really any correlate in SQL. It allows you to find duplicates quickly and probabilistically. In return for a small probability of a false positive, it uses less memory. On Fri, Mar 29, 2013 at

Re: Million docs and word count scenario

2013-03-29 Thread Ted Dunning
Putting each document into a separate file is not likely to be a great thing to do. On the other hand, putting them all into one file may not be what you want either. It is probably best to find a middle ground and create files each with many documents and each a few gigabytes in size. On Fri,

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread Ted Dunning
The EMR distributions have special versions of the s3 file system. They might be helpful here. Of course, you likely aren't running those if you are seeing 5MB/s. An extreme alternative would be to light up an EMR cluster, copy to it, then to S3. On Thu, Mar 28, 2013 at 4:54 AM, Himanish

Re: Static class vs Normal Class when to use

2013-03-28 Thread Ted Dunning
Another Ted piping in. For Hadoop use, it is dangerous to use anything but a static class for your mapper and reducer functions since you may accidentally think that you can access a closed variable from the parent. A static class cannot reference those values so you know that you haven't made

Re: Which hadoop installation should I use on ubuntu server?

2013-03-28 Thread Ted Dunning
Also, Canonical just announced that MapR is available in the Partner repos. On Thu, Mar 28, 2013 at 7:22 AM, Nitin Pawar nitinpawar...@gmail.comwrote: apache bigtop has builds done for ubuntu you can check them at jenkins mentioned on bigtop.apache.org On Thu, Mar 28, 2013 at 11:37 AM,

Re: Naïve k-means using hadoop

2013-03-27 Thread Ted Dunning
And, of course, due credit should be given here. The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step. The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate

Re: Naïve k-means using hadoop

2013-03-27 Thread Ted Dunning
Spark would be an excellent choice for the iterative sort of k-means. It could be good for sketch-based algorithms as well, but the difference would be much less pronounced. On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl charlesce...@me.com wrote: I would think also that starting with centers

Re: copytolocal vs distcp

2013-03-09 Thread Ted Dunning
Try file:///fs4/outdir Symbolic links can also help. Note that this file system has to be visible with the same path on all hosts. You may also be bandwidth limited by whatever is serving that file system. There are cases where you won't be limited by the file system. MapR, for instance, has

Re: Accumulo and Mapreduce

2013-03-04 Thread Ted Dunning
Chaining the jobs is a fantastically inefficient solution. If you use Pig or Cascading, the optimizer will glue all of your map functions into a single mapper. The result is something like: (mapper1 - mapper2 - mapper3) = reducer Here the parentheses indicate that all of the map functions

Re: mapr videos question

2013-02-24 Thread Ted Dunning
The MapR videos on programming and map-reduce are all general videos. The videos that cover capabilities like NFS, snapshots and mirrors are all MapR specific since ordinary Hadoop distributions like Cloudera, Hortonworks and Apache can't support those capabilities. The videos that cover MapR

Re: product recommendations engine

2013-02-17 Thread Ted Dunning
Yeah... you can make this work. First, if your setup is relatively small, then you won't need Hadoop. Second, having lots of kinds of actions is a very reasonable thing to have. My own suggestion is that you analyze these each for their predictive power independently and then combine them at

Re: Correlation between replication factor and read/write performance survey?

2013-02-11 Thread Ted Dunning
The delay due to replication is rarely a large problem in traditional map-reduce programs since many writes are occurring at once. The real problem comes because you are consuming 3x the total disk bandwidth so that the theoretical maximum equilibrium write bandwidth is limited to the lesser of

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Ted Dunning
wrote: We have seen in several of our Hadoop clusters that LVM degrades performance of our M/R jobs, and I remembered a message where Ted Dunning was explaining something about this, and since that time, we don't use LVM for Hadoop data directories. About RAID volumes, the best performance

Re: Question related to Decompressor interface

2013-02-10 Thread Ted Dunning
All of these suggestions tend to founder on the problem of key management. What you need to do is 1) define your threats. 2) define your architecture including key management. 3) demonstrate how the architecture defends against the threat environment. I haven't seen more than a cursory

Re: Hi,can u please help me how to retrieve the videos from hdfs

2013-02-02 Thread Ted Dunning
Works with a real-time version of Hadoop such as MapR. But you are right that HDFS and MapReduce were never intended for real-time use. On Fri, Feb 1, 2013 at 1:40 AM, Mohammad Tariq donta...@gmail.com wrote: How are going to store videos in HDFS?By 'playing video on the browser' I assume

Re: Dell Hardware

2013-01-31 Thread Ted Dunning
We have tested both machines in our labs at MapR and both work well. Both run pretty hot so you need to keep a good eye on that. The R720 will have higher wattage per unit of storage due to the smaller number of drives per chassis. That may be a good match for ordinary Hadoop due to the lower

Re: Suggestions for Change Management System for Hadoop projects

2013-01-27 Thread Ted Dunning
Are you asking about change management for configurations and such? If so, there are good tools out there for managing that including puppet, chef and ansible. Or are you asking about something else? Both Cloudera and MapR have tools that help with centralized configuration management of

Re: How to Backup HDFS data ?

2013-01-24 Thread Ted Dunning
Incremental backups are nice to avoid copying all your data again. You can code these at the application layer if you have nice partitioning and keep track correctly. You can also use platform level capabilities such as provided for by the MapR distribution. On Fri, Jan 25, 2013 at 3:23 PM,

Re: Hadoop Scalability

2013-01-18 Thread Ted Dunning
Also, you may have to adjust your algorithms. For instance, the conventional standard algorithm for SVD is a Lanczos iterative algorithm. Iteration in Hadoop is death because of job invocation time ... what you wind up with is an algorithm that will handle big data but with a slow-down factor

Re: Estimating disk space requirements

2013-01-18 Thread Ted Dunning
. And I am pretty sure it does not have a separate partition for root. Please help me explain what u meant and what else precautions should I take. Thanks, Regards, Ouch Whisper 01010101010 On Jan 18, 2013 11:11 PM, Ted Dunning tdunn...@maprtech.com wrote: Where do you find 40gb disks

Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-28 Thread Ted Dunning
Answer B sounds pathologically bad to me. A or C are the only viable options. Neither B nor D work. B fails because it would be extremely hard to get the right records to the right components and because it pollutes data input with configuration data. D fails because statics don't work in

Re: hadoop -put command

2012-12-26 Thread Ted Dunning
The colon is a reserved character in a URI according to RFC 3986[1]. You should be able to percent encode those colons as %3A. [1] http://tools.ietf.org/html/rfc3986 On Wed, Dec 26, 2012 at 1:00 PM, Mohit Anchlia mohitanch...@gmail.comwrote: It looks like hadoop fs -put command doesn't like

Re: Merging files

2012-12-22 Thread Ted Dunning
The technical term for this is copying. You may have heard of it. It is a subject of such long technical standing that many do not consider it worthy of detailed documentation. Distcp effects a similar process and can be modified to combine the input files into a single file.

Re: Merging files

2012-12-22 Thread Ted Dunning
) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning tdunn...@maprtech.comwrote: The technical term for this is copying. You may have heard of it. It is a subject of such long technical

Re: Alerting

2012-12-22 Thread Ted Dunning
You can write a script to parse the Hadoop job list and send an alert. The trick of putting a retry into your workflow system is a nice one. If your program won't allow multiple copies to run at the same time, then if you re-invoke the program every, say, hour, then 5 retries implies that the

Re: Alerting

2012-12-22 Thread Ted Dunning
Also, I think that Oozie allows for timeouts in job submission. That might answer your need. On Sat, Dec 22, 2012 at 2:08 PM, Ted Dunning tdunn...@maprtech.com wrote: You can write a script to parse the Hadoop job list and send an alert. The trick of putting a retry into your workflow

Re: What should I do with a 48-node cluster

2012-12-20 Thread Ted Dunning
On Thu, Dec 20, 2012 at 7:38 AM, Michael Segel michael_se...@hotmail.comwrote: While Ted ignores that the world is going to end before X-Mas, he does hit the crux of the matter head on. If you don't have a place to put it, the cost of setting it up would kill you, not to mention that you can

Re: Sane max storage size for DN

2012-12-12 Thread Ted Dunning
Yes it does make sense, depending on how much compute each byte of data will require on average. With ordinary Hadoop, it is reasonable to have half a dozen 2TB drives. With specialized versions of Hadoop considerably more can be supported. From what you say, it sounds like you are suggesting

Re: bounce message

2012-11-28 Thread Ted Dunning
Also, the moderators don't seem to read anything that goes by. On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran sat...@morisonmenon.comwrote: In this group once anyone subscribes there is no exit route. -Original Message- From: Tony Burton [mailto:tbur...@sportingindex.com] Sent:

Re: Mapping MySQL schema to Avro

2012-11-24 Thread Ted Dunning
On Sat, Nov 24, 2012 at 5:19 AM, Bart Verwilst li...@verwilst.be wrote: ** ... I'm not sure that i understand your comment about repeating values in fmsswitchvalues, since they are different from the ones in fmssession? I was just pointing out that there were fields in the fmssession record

Re: a question on NameNode

2012-11-19 Thread Ted Dunning
IT sounds like you could benefit from reading the basic papers on map-reduce in general. Hadoop is a reasonable facsimile of the original Google systems. Try looking at this: http://research.google.com/archive/mapreduce.html On Mon, Nov 19, 2012 at 7:14 AM, Kartashov, Andy

Re: backup of hdfs data

2012-11-05 Thread Ted Dunning
Conventional enterprise backup systems are rarely scaled for hadoop needs. Both bandwidth and size are typically lacking. My employer, Mapr, offers a hadoop-derived distribution that includes both point in time snapshots and remote mirrors. Contact me off line for more info. Sent from my

Re: ClientProtocol create、mkdirs 、rename and delete methods are not Idempotent

2012-10-28 Thread Ted Dunning
Create cannot be idempotent because of the problem of watches and sequential files. Similarly, mkdirs, rename and delete cannot generally be idempotent. In particular applications, you might find it is OK to treat them as such, but there are definitely applications where they are not idempotent.

Re: Cluster wide atomic operations

2012-10-28 Thread Ted Dunning
On Sun, Oct 28, 2012 at 9:15 PM, David Parks davidpark...@yahoo.com wrote: I need a unique permanent ID assigned to new item encountered, which has a constraint that it is in the range of, let’s say for simple discussion, one to one million. Having such a limited range may require that you

Re: ClientProtocol create、mkdirs 、rename and delete methods are not Idempotent

2012-10-28 Thread Ted Dunning
, I can better understand the problem. 2012/10/29 Ted Dunning tdunn...@maprtech.com Create cannot be idempotent because of the problem of watches and sequential files. Similarly, mkdirs, rename and delete cannot generally be idempotent. In particular applications, you might find it is OK

Re: Cluster wide atomic operations

2012-10-26 Thread Ted Dunning
This is better asked on the Zookeeper lists. The first answer is that global atomic operations are a generally bad idea. The second answer is that if you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global

Re: rules engine with Hadoop

2012-10-19 Thread Ted Dunning
Unification in a parallel cluster is a difficult problem. Writing very large scale unification programs is an even harder problem. What problem are you trying to solve? One option would be that you need to evaluate a conventionally-sized rulebase against many inputs. Map-reduce should be

Re: Suitability of HDFS for live file store

2012-10-15 Thread Ted Dunning
If you are going to mention commercial distros, you should include MapR as well. Hadoop compatible, very scalable and handles very large numbers of files in a Posix-ish environment. On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Hi, We use HDFS to process data

Re: DFS respond very slow

2012-10-15 Thread Ted Dunning
Uhhh... Alexey, did you really mean that you are running 100 mega bit per second network links? That is going to make hadoop run *really* slowly. Also, putting RAID under any DFS, be it Hadoop or MapR is not a good recipe for performance. Not that it matters if you only have 10mega bytes per

Re: Logistic regression package on Hadoop

2012-10-12 Thread Ted Dunning
Harsh, THanks for the plug. Rajesh has been talking to us. On Fri, Oct 12, 2012 at 8:36 AM, Harsh J ha...@cloudera.com wrote: Hi Rajesh, Please head over to the Apache Mahout project. See https://cwiki.apache.org/MAHOUT/logistic-regression.html Apache Mahout is homed at

Re: Spindle per Cores

2012-10-12 Thread Ted Dunning
It depends on your distribution. Some distributions are more efficient at driving spindles than others. Ratios as high as 2 spindles per core are sometimes quite reasonable. On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: I have read around about the

Re: Spindle per Cores

2012-10-12 Thread Ted Dunning
I think that this rule of thumb is to prevent people configuring 2 disk clusters with 16 cores or 48 disk machines with 4 cores. Both configurations could make sense in narrow applications, but both would most probably be sub-optimal. Within narrow bands, I doubt you will see huge changes. I

Re: Hadoop/Lucene + Solr architecture suggestions?

2012-10-11 Thread Ted Dunning
by Hadoop. Hi Lance, I'm curious if you've gotten that to work with a decent-sized (e.g. 250 node) cluster? Even a trivial cluster seems to crush SolrCloud from a few months ago at least... Thanks, --tim - Original Message - | From: Ted Dunning tdunn...@maprtech.comtdunn

Re: Hadoop/Lucene + Solr architecture suggestions?

2012-10-10 Thread Ted Dunning
I prefer to create indexes in the reducer personally. Also you can avoid the copies if you use an advanced hadoop-derived distro. Email me off list for details. Sent from my iPhone On Oct 9, 2012, at 7:47 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, if I create a Lucene index in

Re: Cumulative value using mapreduce

2012-10-04 Thread Ted Dunning
The answer is really the same. Your problem is just using a goofy representation for negative numbers (after all, negative numbers are a relatively new concept in accounting). You still need to use the account number as the key and the date as a sort key. Many financial institutions also

Re: HADOOP in Production

2012-10-02 Thread Ted Dunning
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen hank.co...@altior.com wrote: There is an important difference between real time and real fast Real time means that system response must meet a fixed schedule. Real fast just means sooner is better. Good thought, but real-time can also include a

Re: splitting jobtracker and namenode

2012-09-26 Thread Ted Dunning
Why are you changing the TTL on DNS if you aren't moving the name? If you are just changing the config to a new name, then caching won't matter. On Wed, Sep 26, 2012 at 1:46 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, My production Hadoop 0.20.2 cluster has been

Re: best way to join?

2012-08-31 Thread Ted Dunning
at the begining) this list of N nearest points for every point in the file. Where N is a parameter given to the job. Let's say 10 points. That's it. No calculation after-wards, only querying that list. Thank you On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning tdunn...@maprtech.comwrote: I

Re: best way to join?

2012-08-30 Thread Ted Dunning
? and calculating the distance as i go On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning tdunn...@maprtech.comwrote: I don't mean that. I mean that a k-means clustering with pretty large clusters is a useful auxiliary data structure for finding nearest neighbors. The basic outline is that you

Re: How to unsubscribe (was Re: unsubscribe)

2012-08-29 Thread Ted Dunning
That was a stupid joke. It wasn't real advice. Have you sent email to the specific email address listed? On Thu, Aug 30, 2012 at 12:35 AM, sathyavageeswaran sat...@morisonmenon.com wrote: I have tried every trick to get self unsubscribed. Yesterday I got a mail saying you can't unsubscribe

Re: How to unsubscribe (was Re: unsubscribe)

2012-08-29 Thread Ted Dunning
...@morisonmenon.com wrote: Of course have sent emails to all permutations and combinations of emails listed with appropriate subject matter. ** ** *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* 30 August 2012 10:12 *To:* user@hadoop.apache.org *Cc:* Dan Yi; Jay *Subject:* Re

Re: How to unsubscribe (was Re: unsubscribe)

2012-08-29 Thread Ted Dunning
** ** *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* 30 August 2012 10:28 *To:* user@hadoop.apache.org *Cc:* Dan Yi; Jay *Subject:* Re: How to unsubscribe (was Re: unsubscribe) ** ** Can you say which addresses you sent emails so? ** ** The merging of mailing

Re: best way to join?

2012-08-28 Thread Ted Dunning
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan dextermorga...@gmail.comwrote: I understand your solution ( i think) , didn't think of that, in that particular way. I think that lets say i have 1M data-points, and running knn , that the k=1M and n=10 (each point is a cluster that requires up

Re: best way to join?

2012-08-28 Thread Ted Dunning
points? join of a file with it self. Thanks On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning tdunn...@maprtech.comwrote: On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan dextermorga...@gmail.comwrote: I understand your solution ( i think) , didn't think of that, in that particular way. I

Re: best way to join?

2012-08-27 Thread Ted Dunning
Mahout is getting some very fast knn code in version 0.8. The basic work flow is that you would first do a large-scale clustering of the data. Then you would make a second pass using the clustering to facilitate fast search for nearby points. The clustering will require two map-reduce jobs, one

Re: Storing millions of small files

2012-05-23 Thread Ted Dunning
Mongo has the best out of box experience of anything, but can be limited in terms of how far it will scale. Hbase is a bit tricky to manage if you don't have expertise in managing Hadoop. Neither is a great idea if your data objects can be as large as 10MB. On Wed, May 23, 2012 at 8:30 AM,

Re: Hadoop HA

2012-05-22 Thread Ted Dunning
No. 2.0.0 will not have the same level of ha as MapR. Specifically, the job tracker hasn't been addressed and the name node Issues have only been partially addressed. On May 22, 2012, at 8:08 AM, Martinus Martinus martinus...@gmail.com wrote: Hi Todd, Thanks for your answer. Is that will

Re: Hadoop HDFS Backup/Restore Solutions

2012-01-03 Thread Ted Dunning
MapR provides this out of the box in a completely Hadoop compatible environment. Doing this with straight Hadoop involves a fair bit of baling wire. On Tue, Jan 3, 2012 at 1:10 PM, alo alt wget.n...@googlemail.com wrote: Hi Mac, hdfs has at the moment no solution for an complete backup- and

Re: hdfs-nfs - through chokepoint or balanced?

2011-12-16 Thread Ted Dunning
Joey is speaking precisely, but in an intentionally very limited way. Apache HDFS, the file system that comes with Apache Hadoop does not support NFS. On the other hand, maprfs which is a part of the commercial MapR distribution which is based on Apache Hadoop does support NFS natively and

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
HDFS is a filesystem that is designed to support map-reduce computation. As such, the semantics differ from what SVN or GIT would want to have. HBase provides versioned values. That might suffice for your needs. On Mon, Nov 21, 2011 at 9:58 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Do we

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
How big is that? On Mon, Nov 21, 2011 at 9:26 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi Ted, Well in my case document size can be big, which is not good to keep in Hbase. So I rule out this option. ** ** Thanks ** ** *From:* Ted Dunning [mailto:tdunn

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
are going bigger than MBs then it is not good to use Hbase for storage. ** ** Any Comments ** ** *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* Tuesday, November 22, 2011 11:43 AM *To:* hdfs-user@hadoop.apache.org *Subject:* Re: Version control of files present in HDFS

Re: Sizing help

2011-11-11 Thread Ted Dunning
, not 12GB. So about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is still unacceptable, so use 3x replication! :-) --Matt On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning tdunn...@maprtech.com wrote: 3x replication has two effects. One is reliability. This is probably more

Re: dfs.write.packet.size set to 2G

2011-11-08 Thread Ted Dunning
By snapshots, I mean that you can freeze a copy of a portion of the the file system for later use as a backup or reference. By mirror, I mean that a snapshot can be transported to another location in the same cluster or to another cluster and the mirrored image will be updated atomically to the

Re: Sizing help

2011-11-08 Thread Ted Dunning
for this usage, however. On Tue, Nov 8, 2011 at 7:32 AM, Rita rmorgan...@gmail.com wrote: Thats a good point. What is hdfs is used as an archive? We dont really use it for mapreduce more for archival purposes. On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning tdunn...@maprtech.com wrote: 3x

Re: Sizing help

2011-11-07 Thread Ted Dunning
Depending on which distribution and what your data center power limits are you may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. With suitable engineering margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack. If you want to go all

Re: set reduced block size for a specific file

2011-08-27 Thread Ted Dunning
There is no way to do this for standard Apache Hadoop. But other, otherwise Hadoop compatible, systems such as MapR do support this operation. Rather than push commercial systems on this mailing list, I would simply recommend anybody who is curious to email me. On Sat, Aug 27, 2011 at 12:07 PM,

Re: Running a server on HDFS

2011-07-12 Thread Ted Dunning
HDFS is not a normal file system. Instead highly optimized for running map-reduce. As such, it uses replicated storage but imposes a write-once model on files. This probably makes it unsuitable as a primary storage for VM's. What you need is either a conventional networked storage device or if

Re: Poor IO performance on a 10 node cluster.

2011-06-01 Thread Ted Dunning
It is also worth using dd to verify your raw disk speeds. Also, expressing disk transfer rates in bytes per second makes it a bit easier for most of the disk people I know to figure out what is large or small. Each of these disks disk should do about 100MB/s when driven well. Hadoop does OK,

Re: trying to select technology

2011-05-31 Thread Ted Dunning
To pile on, thousands or millions of documents are well within the range that is well addressed by Lucene. Solr may be an even better option than bare Lucene since it handles lots of the boilerplate problems like document parsing and index update scheduling. On Tue, May 31, 2011 at 11:56 AM,

Re: Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Ted Dunning
itr.nextToken() is inside the if. On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote: while (itr.hasMoreTokens()) { if(count == 5) { word.set(itr.nextToken()); output.collect(word, one); } count++; }

Re: Hadoop and WikiLeaks

2011-05-19 Thread Ted Dunning
ZK started as sub-project of Hadoop. On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote: Interesting to note that Cassandra and ZK are now considered Hadoop projects. There were independent of Hadoop before the recent update. On Thu, May 19, 2011 at 4:18 AM, Steve

Re: matrix-vector multiply in hadoop

2011-05-17 Thread Ted Dunning
Try using the Apache Mahout code that solves exactly this problem. Mahout has a distributed row-wise matrix that is read one row at a time. Dot products with the vector are computed and the results are collected. This capability is used extensively in the large scale SVD's in Mahout. On Tue,

Re: Suggestions for swapping issue

2011-05-11 Thread Ted Dunning
How is it that 36 processes are not expected if you have configured 48 + 12 = 50 slots available on the machine? On Wed, May 11, 2011 at 11:11 AM, Adi adi.pan...@gmail.com wrote: By our calculations hadoop should not exceed 70% of memory. Allocated per node - 48 map slots (24 GB) , 12 reduce

Re: questions about hadoop map reduce and compute intensive related applications

2011-04-30 Thread Ted Dunning
On Sat, Apr 30, 2011 at 12:18 AM, elton sky eltonsky9...@gmail.com wrote: I got 2 questions: 1. I am wondering how hadoop MR performs when it runs compute intensive applications, e.g. Monte carlo method compute PI. There's a example in 0.21, QuasiMonteCarlo, but that example doesn't use

Re: Serving Media Streaming

2011-04-30 Thread Ted Dunning
Check out S4 http://s4.io/ On Fri, Apr 29, 2011 at 10:13 PM, Luiz Fernando Figueiredo luiz.figueir...@auctorita.com.br wrote: Hi guys. Hadoop is well known to process large amounts of data but we think that we can do much more than it. Our goal is try to serve pseudo-streaming near of

Re: Applications creates bigger output than input?

2011-04-30 Thread Ted Dunning
Cooccurrence analysis is commonly used in recommendations. These produce large intermediates. Come on over to the Mahout project if you would like to talk to a bunch of people who work on these problems. On Fri, Apr 29, 2011 at 9:31 PM, elton sky eltonsky9...@gmail.com wrote: Thank you for

Re: providing the same input to more than one Map task

2011-04-22 Thread Ted Dunning
I would recommend taking this question to the Mahout mailing list. The short answer is that matrix multiplication by a column vector is pretty easy. Each mapper reads the vector in the configure method and then does a dot product for each row of the input matrix. Results are reassembled into a

Re: Estimating Time required to compute M/Rjob

2011-04-17 Thread Ted Dunning
Turing completion isn't the central question here, really. The truth is, map-reduce programs have considerably pressure to be written in a scalable fashion which limits them to fairly simple behaviors that result in pretty linear dependence of run-time on input size for a given program. The cool

Re: Estimating Time required to compute M/Rjob

2011-04-16 Thread Ted Dunning
Sounds like this paper might help you: Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno, Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, David Patterson http://radlab.cs.berkeley.edu/publication/187

Re: Dynamic Data Sets

2011-04-13 Thread Ted Dunning
Hbase is very good at this kind of thing. Depending on your aggregation needs OpenTSDB might be interesting since they store and query against large amounts of time ordered data similar to what you want to do. It isn't clear to whether your data is primarily about current state or about

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
nothing architecture. This may be more database terminology that could be addressed by hbase, but I think it is good background for the questions of memory mapping files in hadoop. Kevin -Original Message- From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Tuesday, April 12, 2011

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
, Jason Rutherglen jason.rutherg...@gmail.com wrote: Then one could MMap the blocks pertaining to the HDFS file and piece them together. Lucene's MMapDirectory implementation does just this to avoid an obscure JVM bug. On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning tdunn...@maprtech.com wrote

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Blocks live where they land when first created. They can be moved due to node failure or rebalancing, but it is typically pretty expensive to do this. It certainly is slower than just reading the file. If you really, really want mmap to work, then you need to set up some native code that builds

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Actually, it doesn't become trivial. It just becomes total fail or total win instead of almost always being partial win. It doesn't meet Benson's need. On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: To get around the chunks or blocks problem, I've been

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Benson is actually a pretty sophisticated guy who knows a lot about mmap. I engaged with him yesterday on this since I know him from Apache. On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas mcsri...@gmail.com wrote: I am not sure if you realize, but HDFS is not VM integrated.

Re: Using global reverse lookup tables

2011-04-11 Thread Ted Dunning
Depending on the function that you want to use, it sounds like you want to use a self join to compute transposed cooccurrence. That is, it sounds like you want to find all the sets that share elements with X. If you have a binary matrix A that represents your set membership with one row per set

Re: Memory mapped resources

2011-04-11 Thread Ted Dunning
Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization

  1   2   3   4   5   6   7   >