Re: can block size for namenode be different from wdatanode block size?

2015-03-25 Thread Mirko Kämpf
block include overhead for three > replicas? So with 128MB block a 1GB file will be 8 blocks with 200 + 8x200 > around 1800 bytes memory in namenode? > > Thx > Let your email find you with BlackBerry from Vodafone > ------ > *From: * Mirko Kämpf > *D

Re: can block size for namenode be different from wdatanode block size?

2015-03-25 Thread Mirko Kämpf
ache cannot hold all data. Is this the > case the case with metada as well? > > Regards, > > Mich > Let your email find you with BlackBerry from Vodafone > ------ > *From: * Mirko Kämpf > *Date: *Wed, 25 Mar 2015 15:20:03 + > *To: *user@hadoop

Re: can block size for namenode be different from datanode block size?

2015-03-25 Thread Mirko Kämpf
Hi Mich, please see the comments in your text. 2015-03-25 15:11 GMT+00:00 Dr Mich Talebzadeh : > > Hi, > > The block size for HDFS is currently set to 128MB by defauilt. This is > configurable. > Correct, an HDFS client can overwrite the cfg-property and define a different block size for HDFS

Re: HDFS - many files, small size

2014-10-02 Thread Mirko Kämpf
Hi Roger, you can use Apache Flume to ingest this files into your cluster. Store it in an HBase table for fast random access and extract the "metadata" on the fly using morphlines (See: http://kitesdk.org/docs/0.11.0/kite-morphlines/index.html). Even then base64 conversion can be done on the fly i

Re: Re-sampling time data with MR job. Ideas

2014-09-19 Thread Mirko Kämpf
get different entity_id's and based on the input split. > > > > Georgi > > > On 19.09.2014 10:34, Mirko Kämpf wrote: > > Hi Georgi, > > I would already emit the new time stamp (with resolution 10 min) in the > mapper. This allows you to (pre)aggregate the d

Re: Re-sampling time data with MR job. Ideas

2014-09-19 Thread Mirko Kämpf
Hi Georgi, I would already emit the new time stamp (with resolution 10 min) in the mapper. This allows you to (pre)aggregate the data already in the mapper and you have less traffic during the shuffle & sort stage. Changing the resolution means you have to aggregate the individual entities or do y

Re: YARN Application Master Question

2014-08-06 Thread Mirko Kämpf
In this case you would have 3 AM for MR jobs and 2 more AMs, one for each Giraph job. Makes a total of 5 AMs. Cheers, Mirko 2014-08-06 11:57 GMT+01:00 Ana Gillan : > Hi, > > In the documentation and papers about Apache YARN, they say that an > Application Master is launched for every applicati

Re: Decommissioning a data node and problems bringing it back online

2014-07-24 Thread Mirko Kämpf
After you added the nodes back to your cluster you run the balancer tool, but it will not bring in exactly the same blocks like before. Cheers, Mirko 2014-07-24 17:34 GMT+01:00 andrew touchet : > Thanks for the reply, > > I am using Hadoop-0.20. We installed from Apache not cloundera, if that

Re: Need to evaluate a cluster

2014-07-10 Thread Mirko Kämpf
erver? > Knowing that, I prefer the LOW-COST. > > What could be the price of a LOW-COST server with 12HDD (3TB) ? > > > > Regards > > > > *From:* Mirko Kämpf [mailto:mirko.kae...@gmail.com] > *Sent:* Thursday 10 July 2014 11:01 > > *To:* user@hadoop.apac

Re: Need to evaluate a cluster

2014-07-10 Thread Mirko Kämpf
binary data like images, mp3 oder video formats? Cheers, Mirko 2014-07-10 10:43 GMT+02:00 YIMEN YIMGA Gael : > Hi, > > > > What does « 1.3 for overhead » mean in this calculation ? > > > > Regards > > > > *From:* Mirko Kämpf [mailto:mirko.kae...@gmail.com

Re: Need to evaluate a cluster

2014-07-09 Thread Mirko Kämpf
Hello, if I follow your numbers I see one missing fact: *What is the number of HDDs per DataNode*? Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x capacity per HDD

Re: HBase metadata

2014-07-08 Thread Mirko Kämpf
Hi John, I suggest the project: http://www.kiji.org/ or even the brand new: http://phoenix.apache.org/ Cheers, Mirko 2014-07-08 16:05 GMT+00:00 John Lilley : > Greetings! > > > > We would like to support HBase in a general manner, having our software > connect to any HBase table and read/wr

Re: partition file by content based through HDFS

2014-05-11 Thread Mirko Kämpf
Hi, HDFS blocks are not "content aware". Such a separation like you requested, could be done via Hive or Pig with some lines of code, than you would have multiple files which can be organized in partitions as well, but such partitions are on a different abstraction level, not on blocks, but within

Re: I am about to lose all my data please help

2014-03-16 Thread Mirko Kämpf
Hi, what is the location of the namenodes fsimage and editlogs? And how much memory has the NameNode. Did you work with a Secondary NameNode or a Standby NameNode for checkpointing? Where are your HDFS blocks located, are those still safe? With this information at hand, one might be able to fix

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

2014-02-10 Thread Mirko Kämpf
Hi Raj, there is no way of adding new data to a file in HDFS as long as the append functionality is not available. Adding new "records" to a Hive table means, creating a new file with those records. You do this in the "staging" table which might be inefficient for large data sets especially if you

Re:

2013-12-13 Thread Mirko Kämpf
ow to write MapReduce Jobs, so can you please suggest any > website links or can you send any notes to me > > Thank you,for your answer, because of you cleared one doubt. > > with Regards, > chandu. > > > On Thu, Dec 12, 2013 at 7:05 PM, Mirko Kämpf wrote: &

Re:

2013-12-12 Thread Mirko Kämpf
The procedure of splitting the larger file into blocks is handled by the client. It delivers each block to a DataNode (can be a different one for each block, but does not have to be, e.g. in a pseudo distributed cluster we have only one node). Replication of the blocks is handled in the cluster by

Re: Why is Hadoop always running just 4 tasks?

2013-12-11 Thread Mirko Kämpf
Hi, what is the command you execute to submit the job? Please share also the driver code So we can troubleshoot better. Best wishes Mirko 2013/12/11 Dror, Ittay > I have a cluster of 4 machines with 24 cores and 7 disks each. > > On each node I copied from local a file of 500G. So I h

Re: Execute hadoop job remotely and programmatically

2013-12-10 Thread Mirko Kämpf
Hi Yexi, please have a look at the -libjars option of the hadoop cmd. It tells the system what additional libs have to be sent to the cluster before the job can start. Each time you submit the job, this kind of distribution happens again. So its not a good idea for really large libs, those you sho

Re: XML parsing in Hadoop

2013-11-27 Thread Mirko Kämpf
Chhaya, did you run the same code in stand alone mode without MapReduce framework? How long takes the code in you map() function standalone? Compare those two different times (t_0 MR mode, t_1 standalone mode) to find out if it is a MR issue or something which comes from the xml-parser logic or th

Re: Difference between clustering and classification in hadoop

2013-11-22 Thread Mirko Kämpf
... it depends on the implementation. ;-) Mahout offers both: "Mahout in action" http://manning.com/owen/ And is more ... http://en.wikipedia.org/wiki/Cluster_analysis http://en.wikipedia.org/wiki/Statistical_classification Good look! Mirko 2013/11/22 unmesha sreeveni > what is the di

Re: SVM implementaion

2013-11-07 Thread Mirko Kämpf
There are three Jira issues about it: Mahout-14 and Mahout-334 have been closed as won't fix. Mahout-232 was assigned later because some code was co

Re: how to schedule the hadoop commands in cron job or any orther way?

2013-11-07 Thread Mirko Kämpf
t; thanks in advance. > > > On Thu, Nov 7, 2013 at 5:48 PM, Mirko Kämpf wrote: > >> Hi, >> >> it is easy to create a shell script which contains the hadoop jar ... >> command with all it's parameters. >> And this command can be used in your cron job. &g

Re: how to schedule the hadoop commands in cron job or any orther way?

2013-11-07 Thread Mirko Kämpf
Hi, it is easy to create a shell script which contains the hadoop jar ... command with all it's parameters. And this command can be used in your cron job. But are you sure, you are working on scheduling? Is it job submission you talk about? Scheduling in Hadoop is more related to tasks. Have a loo

Re: Volunteer

2013-11-06 Thread Mirko Kämpf
What is the field you want to working in, core hadoop development, scripting and testing, documentation, tool development, app development, benchmarking? What is your level of experience? What programming languages do you use? I think you can just start with building hadoop and its related project

Re: CompateTo() in WritableComparable

2013-11-04 Thread Mirko Kämpf
You just have to implement the WritableComparable interface. This might be straight forward for any simple data type. Have look on: public int compareTo(MyWritableComparable o) { int thisValue = this.value int thatValue = o.value; return (thisValue < thatValue ? -1 : (thisValue

Re: best solution for data ingestion

2013-11-01 Thread Mirko Kämpf
Have a look on Sqoop for data from RDBMS or Flume, if data flows and multiple sources have to be used. Best wishes Mirko 2013/11/1 Siddharth Tiwari > hi team > > seeking your advice on what could be best way to ingest a lot of data to > hadoop. Also what are views about fuse ? > > > **-

Re: Issues in emitting 2D double array to Reducer.

2013-10-28 Thread Mirko Kämpf
Hi, you should create a custom data type, which contains two of the DoubleArrayWritable instances. This custom data type idea is also explained here: http://my.safaribooksonline.com/book/-/9781849517287/4dot-developing-complex-hadoop-mapreduce-applications/ch04s03_html Good luck. Mirko 2013/10

Re: issue about hadoop hardware choose

2013-08-08 Thread Mirko Kämpf
Hello Ch Huang, Do you know this book? "Hadoop Operations" http://shop.oreilly.com/product/0636920025085.do I think, it answers most of the questions in detail. For a production cluster you should consider MRv1. And I suggest you, to go with more hard drives per slave node to have a higher IO b

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

2013-05-11 Thread Mirko Kämpf
Hi David, your first point : the role of thumb is : "one disk per CPU (or per 1.5 to 2 CPU)" in your case more parrallel IO could be possible with more disks, but as you wrote, you have less IO bound processing things might change and a SSD could speed up shuffle & sort phase, but I suggest to d

Re: Saving data in db instead of hdfs

2013-05-02 Thread Mirko Kämpf
Hi, just use Sqoop to push the data from HDFS to a database via JDBC. Intro to Sqoop: http://blog.cloudera.com/blog/2009/06/introducing-sqoop/ Or even use Hive-JDBC to connect to your result data from outside the hadoop cluster. You can also create your own OutputFormat (with Java API), which w

Re: Multiple mappers, different parameter values

2013-04-02 Thread Mirko Kämpf
Hi, I would add an id to the parameter name: "num.iterations.ID=something" If your mapper knows what ID it has it can just pick up this value from the context. But the question is: How does the mapper know about it's ID? Is it related to the input? Thank it can be calculated but this is a domain s

Re: each stage's time in hadoop

2013-03-06 Thread Mirko Kämpf
Hi, please have a look on the "Starfish" project. http://www.cs.duke.edu/starfish/ Best wishes Mirko 2013/3/6 claytonly > ** > ** > Hello ,all > > I was using hadoop-1.0.0 in ubuntu 12.04. I was wondering how I can > know each stage's running time in mapreduce. I got some information fr

Re: Estimating disk space requirements

2013-01-18 Thread Mirko Kämpf
Hi, some comments are inside your message ... 2013/1/18 Panshul Whisper > Hello, > > I was estimating how much disk space do I need for my cluster. > > I have 24 million JSON documents approx. 5kb each > the Json is to be stored into HBASE with some identifying data in coloumns > and I also wa

Re: How to copy log files from remote windows machine to Hadoop cluster

2013-01-17 Thread Mirko Kämpf
can push the >>> latest updates into the cluster. >>> Or I have to simply push once in a day to the cluster using spooling >>> directory mechanism. >>> >>> Can somebody assist whether it is possible using Flume if so the >>> configurations needed for

Re: queues in haddop

2013-01-11 Thread Mirko Kämpf
I would suggest to work with Flume, in order to clollect a certain number of files and store it to HDFS in larger chunk or write it directly to HBase, this allows random access later on (if need) otherwise HBase could be an overkill. You can collect data in an MySQL DB and than import regularly via

Re: best way to join?

2012-09-10 Thread Mirko Kämpf
Hi Dexter, I am no sure if I understood your requirements right. So I repet it to define a starting point. 1.) You have a (static) list of points (the points.txt file) 2.) Now you want to calculate the nearest points to a set of given points. Are the points which have to be considered in a diffe