Re: Test

2014-05-11 Thread Azuryy
Got.

But it doesn't indicate all can receive this test.

Mail list is unstable recently.


Sent from my iPhone5s

 On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 This message has no content.


Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy.


Sent from my iPhone5s

 On 2014年4月4日, at 23:06, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Can anybody suggest how to change compression level (Record, Block) for 
 Snappy? 
 if it possible, of course
 
 thank you in advance
 
 Thank you,
 Konstantin Kudryavtsev
 
 
 On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 Thanks all, it works fine now and I managed to compress output. However, I 
 am still in stuck... How is it possible to set compression type for Snappy? 
 I mean to set up record or block level of compression for output
 
 On Apr 3, 2014 1:15 AM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 Thanks for pointing that out.
 
 
 On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.com 
 wrote:
 First, you shouldn't be using spark.incubator.apache.org anymore, just 
 spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist in 
 the Python API at this point. 
 
 
 On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 Is this a Scala-only feature?
 
 
 On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com 
 wrote:
 For textFile I believe we overload it and let you set a codec directly:
 
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
 
 For saveAsSequenceFile yep, I think Mark is right, you need an option.
 
 
 On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra m...@clearstorydata.com 
 wrote:
 http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
 
 The signature is 'def saveAsSequenceFile(path: String, codec: 
 Option[Class[_ : CompressionCodec]] = None)', but you are providing a 
 Class, not an Option[Class].  
 
 Try counts.saveAsSequenceFile(output, 
 Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
 
 
 
 On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 Hi there,
 
 
 
 I've started using Spark recently and evaluating possible use cases in 
 our company. 
 
 I'm trying to save RDD as compressed Sequence file. I'm able to save 
 non-compressed file be calling:
 
 
 
 
 
 counts.saveAsSequenceFile(output)
 where counts is my RDD (IntWritable, Text). However, I didn't manage 
 to compress output. I tried several configurations and always got 
 exception:
 
 
 
 
 
 counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 console:21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
  required: Option[Class[_ : 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 console:21: error: type mismatch;
  found   : 
 Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
  required: Option[Class[_ : 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 and it doesn't work even for Gzip:
 
 
 
 
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 console:21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
  required: Option[Class[_ : 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 Could you please suggest solution? also, I didn't find how is it 
 possible to specify compression parameters (i.e. compression type for 
 Snappy). I wondered if you could share code snippets for 
 writing/reading RDD with compression? 
 
 Thank you in advance,
 
 Konstantin Kudryavtsev
 


Re: Relation between DStream and RDDs

2014-03-21 Thread Azuryy
Thanks for sharing here.

Sent from my iPhone5s

 On 2014年3月21日, at 20:44, Sanjay Awatramani sanjay_a...@yahoo.com wrote:
 
 Hi,
 
 I searched more articles and ran few examples and have clarified my doubts. 
 This answer by TD in another thread ( 
 https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped 
 me a lot.
 
 Here is the summary of my finding:
 1) A DStream can consist of 0 or 1 or more RDDs.
 2) Even if you have multiple files to be read in a time interval, DStream 
 will have only 1 RDD.
 3) Functions like reduce  count return as many no. of RDDs as there were in 
 the input DStream. However the internal computation in every batch will have 
 only 1 RDD, so these functions will return 1 RDD in the returned DStream. 
 However if you are using window functions to get more RDDs, and run 
 reduce/count on the windowed DStream, your returned DStream will have more 
 than 1 RDD.
 
 Hope this helps someone.
 Thanks everyone for the answers.
 
 Regards,
 Sanjay
 
 
 On Thursday, 20 March 2014 9:30 PM, andy petrella andy.petre...@gmail.com 
 wrote:
 Don't see an example, but conceptually it looks like you'll need an according 
 structure like a Monoid. I mean, because if it's not tied to a window, it's 
 an overall computation that has to be increased over time (otherwise it would 
 land in the batch world see after) and that will be the purpose of Monoid, 
 and specially probabilistic sets (avoid sucking the whole memory).
 
 If it falls in the batch job's world because you have enough information 
 encapsulated in one conceptual RDD, it might be helpful to have DStream 
 storing it in hdfs, then using the SparkContext within the StreaminContext to 
 run a batch job on the data.
 
 But I'm only thinking out of loud, so I might be completely wrong.
 
 hth
 
 Andy Petrella
 Belgium (Liège)

  Data Engineer in NextLab sprl (owner)
  Engaged Citizen Coder for WAJUG (co-founder)
  Author of Learning Play! Framework 2
  Bio: on visify

 Mobile: +32 495 99 11 04
 Mails:  
 andy.petre...@nextlab.be
 andy.petre...@gmail.com

 Socials:
 Twitter: https://twitter.com/#!/noootsab
 LinkedIn: http://be.linkedin.com/in/andypetrella
 Blogger: http://ska-la.blogspot.com/
 GitHub:  https://github.com/andypetrella
 Masterbranch: https://masterbranch.com/andy.petrella
 
 
 On Thu, Mar 20, 2014 at 12:18 PM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
 
 
 On Thu, Mar 20, 2014 at 11:57 AM, andy petrella andy.petre...@gmail.com 
 wrote:
 also consider creating pairs and use *byKey* operators, and then the key will 
 be the structure that will be used to consolidate or deduplicate your data
 my2c
 
 
 One thing I wonder: imagine I want to sub-divide RDDs in a DStream into 
 several RDDs but not according to time window, I don't see any trivial way to 
 do it...
  
 
 
 On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 Actually it's quite simple...
 
 DStream[T] is a stream of RDD[T].
 So applying count on DStream is just applying count on each RDD of this 
 DStream.
 So at the end of count, you have a DStream[Int] containing the same number of 
 RDDs as before but each RDD just contains one element being the count result 
 for the corresponding original RDD.
 
 For reduce, it's the same using reduce operation...
 
 The only operations that are a bit more complex are reduceByWindow  
 countByValueAndWindow which union RDD over the time window...
 
 On Thu, Mar 20, 2014 at 9:51 AM, Sanjay Awatramani sanjay_a...@yahoo.com 
 wrote:
 @TD: I do not need multiple RDDs in a DStream in every batch. On the contrary 
 my logic would work fine if there is only 1 RDD. But then the description for 
 functions like reduce  count (Return a new DStream of single-element RDDs by 
 counting the number of elements in each RDD of the source DStream.) left me 
 confused whether I should account for the fact that a DStream can have 
 multiple RDDs. My streaming code processes a batch every hour. In the 2nd 
 batch, i checked that the DStream contains only 1 RDD, i.e. the 2nd batch's 
 RDD. I verified this using sysout in foreachRDD. Does that mean that the 
 DStream will always contain only 1 RDD ?
 
 A DStream creates a RDD for each window corresponding to your batch duration 
 (maybe if there are no data in the current time window, no RDD is created but 
 I'm not sure about that)
 So no, there is not one single RDD in a DStream, it just depends on the batch 
 duration and the collected data.
 
  
 Is there a way to access the RDD of the 1st batch in the 2nd batch ? The 1st 
 batch may contain some records which were not relevant to the first batch and 
 are to be processed in the 2nd batch. I know i can use the sliding window 
 mechanism of streaming, but if i'm not using it and there is no way to access 
 the previous batch's RDD, then it means that functions like count will always 
 return a DStream containing only 1 RDD,