Re: Parquet-SPARK-PIG integration.

2014-04-26 Thread suman bharadwaj
Figured how to do it. Hence thought of sharing in case if someone is
interested.

import parquet.column.ColumnReader
import parquet.filter.ColumnRecordFilter._
import parquet.filter.ColumnPredicates._
import parquet.hadoop.{ParquetOutputFormat, ParquetInputFormat}
import org.apache.hadoop.mapred.JobConf
import parquet.bytes.BytesInput
import parquet.pig.TupleReadSupport;
import org.apache.pig.data.Tuple;
val conf = new JobConf()
conf.set(parquet.pig.schema,id:int,name:chararray)
ParquetInputFormat.setReadSupportClass(conf,classOf[TupleReadSupport])
val file =
sc.newAPIHadoopFile(path/part-m-0.parquet,classOf[ParquetInputFormat[Tuple]],classOf[Void],classOf[Tuple],conf).map(x=(x._2.get(0),x._2.get(1))).collect

Regards,
SB


On Sat, Apr 26, 2014 at 3:31 PM, suman bharadwaj suman@gmail.comwrote:

 Hi All,

 We have written PIG Jobs which outputs the data in parquet format.

 For eg:

 register parquet-column-1.3.1.jar;
 register parquet-common-1.3.1.jar;
 register parquet-format-2.0.0.jar;
 register parquet-hadoop-1.3.1.jar;
 register parquet-pig-1.3.1.jar;
 register parquet-encoding-1.3.1.jar;

 A =load 'path' using PigStorage('\t') as (in:int,name:chararray);
 store A into 'output_path' using parquet.pig.ParquetStorer();

 Now how do i read this parquet file in SPARK ?

 Thanks in advance.

 Regards,
 SB



Re: Pig on Spark

2014-04-25 Thread suman bharadwaj
Hey Mayur,

We use HiveColumnarLoader and XMLLoader. Are these working as well ?

Will try few things regarding porting Java MR.

Regards,
Suman Bharadwaj S


On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Right now UDF is not working. Its in the top list though. You should be
 able to soon :)
 Are thr any other functionality of pig you use often apart from the usual
 suspects??

 Existing Java MR jobs would be a easier move. are these cascading jobs or
 single map reduce jobs. If single then you should be able to,  write a
 scala wrapper code code to call map  reduce functions with some magic 
 let your core code be. Would be interesting to see an actual example  get
 it to work.

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote:

 We currently are in the process of converting PIG and Java map reduce
 jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence
 was checking if we can leverage SPORK without converting to SPARK jobs.

 And is there any way I can port my existing Java MR jobs to SPARK ?
 I know this thread has a different subject, let me know if need to ask
 this question in separate thread.

 Thanks in advance.


 On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but
 they want something dead simple with which they can do dsl. They end up
 using Hive for a lot of ETL just cause its SQL  they understand it. Pig 
 is
 close  wraps up a lot of framework level semantics away from the user 
 lets him focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
 lot for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always
 gaps and enhancements. I was often thought is DSL right way to solve data
 flow problems? May be not, we need complete language construct. We may 
 have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power!
 If we can improve first 3 lines, here you go, you have most powerful DSL 
 to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and
 the goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM

Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
Are all the features available in PIG working in SPORK ?? Like for eg: UDFs
?

Thanks.


On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it faster.
 I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.comwrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you please
 keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out would
 love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
  http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from several
  other groups, and
  if there's enough of it, maybe we can start another open source repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack week.)
 That
  work also calls
  the Scala API directly, because it was done before we 

Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
We currently are in the process of converting PIG and Java map reduce jobs
to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
checking if we can leverage SPORK without converting to SPARK jobs.

And is there any way I can port my existing Java MR jobs to SPARK ?
I know this thread has a different subject, let me know if need to ask this
question in separate thread.

Thanks in advance.


On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can

Re: PIG to SPARK

2014-03-06 Thread suman bharadwaj
Thanks Mayur. I don't have clear idea on how pipe works wanted to
understand more on it. But when do we use pipe() and how it works ?. Can
you please share some sample code if you have ( even pseudo-code is fine )
? It will really help.

Regards,
Suman Bharadwaj S


On Thu, Mar 6, 2014 at 3:46 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 The real question is why do you want to run pig script using Spark
 Are you planning to user spark as underlying processing engine for Spark?
 thats not simple
 Are you planning to feed Pig data to spark for further processing, then
 you can write it to HDFS  trigger your spark script.

 rdd.pipe is basically similar to Hadoop streaming, allowing you to run a
 script on each partition of the RDD  get output as another RDD.
 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 10:29 AM, suman bharadwaj suman@gmail.comwrote:

 Hi,

 How can i call pig script using SPARK. Can I use rdd.pipe() here ?

 And can anyone share sample implementation of rdd.pipe () and if you can
 explain how rdd.pipe() works, it would really really help.

 Regards,
 SB