Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API.

but it’s not a big issue for your change, only 
com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java 

 need to change, right?

It’s not a big change to 2.x API. if you agree, I can do, but I cannot promise 
the time within one or two weeks because of my daily job.





> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon  wrote:
> 
> Hi all, 
> 
> I am writing this email to both user-group and dev-group since this is 
> applicable to both.
> 
> I am now working on Spark XML datasource 
> (https://github.com/databricks/spark-xml 
> ).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x for 
> version compatibility.
> 
> However, I found all the internal JSON datasource and others in Databricks 
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the 
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an 
> interface in Hadoop 2.x.
> 
> So, I looked through the codes for some advantages for Hadoop 2.x API but I 
> couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
> 
> I understand that it is still preferable to use Hadoop 2.x APIs at least for 
> future differences but somehow I feel like it might not have to use Hadoop 
> 2.x by reflecting a method.
> 
> I would appreciate that if you leave a comment here 
> https://github.com/databricks/spark-xml/pull/14 
>  as well as sending back a 
> reply if there is a good explanation
> 
> Thanks! 



Re: getting error while persisting in hive

2015-12-09 Thread Fengdong Yu
.write   not .write()




> On Dec 9, 2015, at 5:37 PM, Divya Gehlot  wrote:
> 
> Hi,
> I am using spark 1.4.1 .
> I am getting error when persisting spark dataframe output to hive 
> scala> 
> df.select("name","age").write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("PersonHiveTable");
> :39: error: org.apache.spark.sql.DataFrameWriter does not take 
> parameters
>  
> 
> Can somebody points me whats wrong here ?
> 
> Would really appreciate your help.
> 
> Thanks in advance 
> 
> Divya   



Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Fengdong Yu
val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" )

Azuryy Yu
Sr. Infrastructure Engineer

cel: 158-0164-9103
wetchat: azuryy


On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj <
prashant2006s...@gmail.com> wrote:

> Hi
>
> I have two columns in my json which can have null, empty and non-empty
> string as value.
> I know how to filter records which have non-null value using following:
>
> val req_logs = sqlContext.read.json(filePath)
>
> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or
> req_info.dpid_sha1 is not null")
>
> But how to filter if value of column is empty string?
> --
> Regards
> Prashant
>


Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu

what’s your data format? ORC or CSV or others?

val keys = sqlContext.read.orc(“your previous batch data 
path”).select($”uniq_key”).collect
val broadCast = sc.broadCast(keys)

val rdd = your_current_batch_data
rdd.filter( line => line.key  not in broadCase.value)






> On Dec 8, 2015, at 4:44 PM, Ramkumar V <ramkumar.c...@gmail.com> wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>  <https://in.linkedin.com/in/ramkumarcs31> 
> 
> 
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Can you detail your question?  what looks like your previous batch and the 
> current batch?
> 
> 
> 
> 
> 
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V <ramkumar.c...@gmail.com 
>> <mailto:ramkumar.c...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I'm running java over spark in cluster mode. I want to apply filter on 
>> javaRDD based on some previous batch values. if i store those values in 
>> mapDB, is it possible to apply filter during the current batch ?
>> 
>> Thanks,
>> 
>>  <https://in.linkedin.com/in/ramkumarcs31> 
>> 
> 
> 



Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question?  what looks like your previous batch and the 
current batch?





> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
> 
> Hi,
> 
> I'm running java over spark in cluster mode. I want to apply filter on 
> javaRDD based on some previous batch values. if i store those values in 
> mapDB, is it possible to apply filter during the current batch ?
> 
> Thanks,
> 
>   
> 



Re: About Spark On Hbase

2015-12-08 Thread Fengdong Yu
https://github.com/nerdammer/spark-hbase-connector

This is better and easy to use.





> On Dec 9, 2015, at 3:04 PM, censj  wrote:
> 
> hi all,
>  now I using spark,but I not found spark operation hbase open source. Do 
> any one tell me? 
>  



Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Fengdong Yu
Can you try like this in your sbt:


val spark_version = "1.5.2"
val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact 
= "servlet-api")
val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty")

libraryDependencies ++= Seq(
  "org.apache.spark" %%  "spark-sql"  % spark_version % "provided” 
excludeAll(excludeServletApi, excludeEclipseJetty),
  "org.apache.spark" %%  "spark-hive" % spark_version % "provided" 
excludeAll(excludeServletApi, excludeEclipseJetty)
)



> On Dec 8, 2015, at 2:26 PM, Sunil Tripathy  wrote:
> 
> I am getting the following exception when I use spark-submit to submit a 
> spark streaming job.
> 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
> at 
> com.amazonaws.internal.config.InternalConfig.(InternalConfig.java:43)
> 
> I tried with diferent version of of jackson libraries but that does not seem 
> to help.
>  libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % 
> "2.6.3"
> libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.3"
> libraryDependencies += "com.fasterxml.jackson.core" % "jackson-annotations" % 
> "2.6.3"
> 
> Any pointers to resolve the issue?
> 
> Thanks



Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
If your RDD is JSON format, that’s easy.

val df = sqlContext.read.json(rdd)
df.saveAsTable(“your_table_name")



> On Dec 7, 2015, at 5:28 PM, Divya Gehlot  wrote:
> 
> Hi,
> I am new bee to Spark.
> Could somebody guide me how can I persist my spark RDD results in Hive using 
> SaveAsTable API. 
> Would  appreciate if you could  provide the example for hive external table.
> 
> Thanks in advance.
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
I suppose your output data is “ORC”, and want to save to hive database: test, 
external table name is : testTable



import scala.collection.immutable

sqlContext.createExternalTable(“test.testTable", 
"org.apache.spark.sql.hive.orc", Map("path" -> “/data/test/mydata"))


> On Dec 7, 2015, at 5:28 PM, Divya Gehlot  wrote:
> 
> Hi,
> I am new bee to Spark.
> Could somebody guide me how can I persist my spark RDD results in Hive using 
> SaveAsTable API. 
> Would  appreciate if you could  provide the example for hive external table.
> 
> Thanks in advance.
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Fengdong Yu
refer here: 
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html


of section:
Example 4-27. Python custom partitioner




> On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote:
> 
> I'm not a python expert, so I'm wondering if anybody has a working example of 
> a partitioner for the "partitionFunc" argument (default "portable_hash") to 
> rdd.partitionBy()?
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Fengdong Yu
Don’t do Join firstly.

broadcast your small RDD, 

val bc = sc.broadcast(small_rdd)


then large_dd.filter(x.key in  bc.value).map( x =>  {
  bc.value.other_fileds + x  
}).distinct.groupByKey






> On Dec 7, 2015, at 1:41 PM, Z Z  wrote:
> 
> I have two RDDs, one really large in size and other much smaller. I'd like 
> find all unique tuples in large RDD with keys from the small RDD. There are 
> duplicates tuples as well and I only care about the distinct tuples.
> 
> For example
> large_rdd = sc.parallelize([('abcdefghij'[i%10], i) for i in range(100)] * 5)
> small_rdd = sc.parallelize([('zab'[i%3], i) for i in range(10)])
> expected_rdd = [('a', [1, 4, 7, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90]), 
> ('b',   [2, 5, 8, 1, 11, 21, 31, 41, 51, 61, 71, 81, 91])]
> 
> There are two expensive operations in my solution - join and distinct. Both I 
> assume cause a full shuffle and leave the child RDD hash partitioned. Given 
> that, is the following the best I can do ?
> 
> keys = small_rdd.keys().collect()
> filtered_unique_large_rdd = large_rdd.filter(lambda (k,v):k in 
> keys).distinct().groupByKey()
> filtered_unique_large_rdd.join(small_rdd.groupByKey()).mapValues(lambda x: 
> sum([list(i) for i in x], [])).collect()
> 
> Basically, I filter the tuples explicitly, pick distincts and then join with 
> the smaller_rdd. I hope that that distinct operation will place the keys hash 
> partitioned and will not cause another shuffle during the subsequent join.
> 
> Thanks in advance for any suggestions/ideas.



Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
Yes. it results to a shuffle.



> On Dec 4, 2015, at 6:04 PM, Stephen Boesch <java...@gmail.com> wrote:
> 
> @Yu Fengdong:  Your approach - specifically the groupBy results in a shuffle 
> does it not?
> 
> 2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>>:
> There are many ways, one simple is:
> 
> such as: you want to know how many rows for each month:
> 
> sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
> 
> 
> the output looks like:
> 
> monthcount
> 201411100
> 201412200
> 
> 
> hopes help.
> 
> 
> 
> > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com 
> > <mailto:johngou...@gmail.com>> wrote:
> >
> > Hi there,
> >
> > I have my data stored in HDFS partitioned by month in Parquet format.
> > The directory looks like this:
> >
> > -month=201411
> > -month=201412
> > -month=201501
> > -
> >
> > I want to compute some aggregates for every timestamp.
> > How is it possible to achieve that by taking advantage of the existing 
> > partitioning?
> > One naive way I am thinking is issuing multiple sql queries:
> >
> > SELECT * FROM TABLE WHERE month=201411
> > SELECT * FROM TABLE WHERE month=201412
> > SELECT * FROM TABLE WHERE month=201501
> > .
> >
> > computing the aggregates on the results of each query and combining them in 
> > the end.
> >
> > I think there should be a better way right?
> >
> > Thanks
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
There are many ways, one simple is:

such as: you want to know how many rows for each month:

sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count


the output looks like:

monthcount
201411100
201412200


hopes help.



> On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas  wrote:
> 
> Hi there,
> 
> I have my data stored in HDFS partitioned by month in Parquet format.
> The directory looks like this:
> 
> -month=201411
> -month=201412
> -month=201501
> -
> 
> I want to compute some aggregates for every timestamp.
> How is it possible to achieve that by taking advantage of the existing 
> partitioning?
> One naive way I am thinking is issuing multiple sql queries:
> 
> SELECT * FROM TABLE WHERE month=201411
> SELECT * FROM TABLE WHERE month=201412
> SELECT * FROM TABLE WHERE month=201501
> .
> 
> computing the aggregates on the results of each query and combining them in 
> the end.
> 
> I think there should be a better way right?
> 
> Thanks


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Low Latency SQL query

2015-12-01 Thread Fengdong Yu
It depends on many situations:

1) what’s your data format?  csv(text) or ORC/parquet?
2) Did you have Data warehouse to summary/cluster  your data?


if your data is text or you query for the raw data, It should be slow, Spark 
cannot do much to optimize your job.




> On Dec 2, 2015, at 9:21 AM, Andrés Ivaldi  wrote:
> 
> Mark, We have an application that use data from different kind of source, and 
> we build a engine able to handle that, but cant scale with big data(we could 
> but is to time expensive), and doesn't have Machine learning module, etc, we 
> came across with Spark and it's looks like it have all we need, actually it 
> does, but our latency is very low right now, and when we do some testing it 
> took too long time for the same kind of results, always against RDBM which is 
> our primary source. 
> 
> So, we want to expand our sources, to cvs, web service, big data, etc, we can 
> extend our engine or use something like Spark, which give as power of 
> clustering, different kind of source access, streaming, machine learning, 
> easy extensibility and so on. 
> 
> On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra  > wrote:
> I'd ask another question first: If your SQL query can be executed in a 
> performant fashion against a conventional (RDBMS?) database, why are you 
> trying to use Spark?  How you answer that question will be the key to 
> deciding among the engineering design tradeoffs to effectively use Spark or 
> some other solution.
> 
> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi  > wrote:
> Ok, so latency problem is being generated because I'm using SQL as source? 
> how about csv, hive, or another source?
> 
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra  > wrote:
> It is not designed for interactive queries.
> 
> You might want to ask the designers of Spark, Spark SQL, and particularly 
> some things built on top of Spark (such as BlinkDB) about their intent with 
> regard to interactive queries.  Interactive queries are not the only designed 
> use of Spark, but it is going too far to claim that Spark is not designed at 
> all to handle interactive queries.
> 
> That being said, I think that you are correct to question the wisdom of 
> expecting lowest-latency query response from Spark using SQL (sic, presumably 
> a RDBMS is intended) as the datastore.
> 
> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke  > wrote:
> Hmm it will never be faster than SQL if you use SQL as an underlying storage. 
> Spark is (currently) an in-memory batch engine for iterative machine learning 
> workloads. It is not designed for interactive queries. 
> Currently hive is going into the direction of interactive queries. 
> Alternatives are Hbase on Phoenix or Impala.
> 
> On 01 Dec 2015, at 21:58, Andrés Ivaldi  > wrote:
> 
>> Yes, 
>> The use case would be,
>> Have spark in a service (I didnt invertigate this yet), through api calls of 
>> this service we perform some aggregations over data in SQL, We are already 
>> doing this with an internal development
>> 
>> Nothing complicated, for instance, a table with Product, Product Family, 
>> cost, price, etc. Columns like Dimension and Measures,
>> 
>> I want to Spark for query that table to perform a kind of rollup, with cost 
>> as Measure and Prodcut, Product Family as Dimension
>> 
>> Only 3 columns, it takes like 20s to perform that query and the aggregation, 
>> the  query directly to the database with a grouping at the columns takes 
>> like 1s 
>> 
>> regards
>> 
>> 
>> 
>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke > > wrote:
>> can you elaborate more on the use case?
>> 
>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi > > > wrote:
>> >
>> > Hi,
>> >
>> > I'd like to use spark to perform some transformations over data stored 
>> > inSQL, but I need low Latency, I'm doing some test and I run into spark 
>> > context creation and data query over SQL takes too long time.
>> >
>> > Any idea for speed up the process?
>> >
>> > regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
>> 
>> 
>> 
>> -- 
>> Ing. Ivaldi Andres
> 
> 
> 
> 
> -- 
> Ing. Ivaldi Andres
> 
> 
> 
> 
> -- 
> Ing. Ivaldi Andres



Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi
you can try:

if your table under location “/test/table/“ on HDFS
and has partitions:

 “/test/table/dt=2012”
 “/test/table/dt=2013”

df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")



> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
> 
> df.write.partitionBy("date").insertInto("my_table")



Re: load multiple directory using dataframe load

2015-11-23 Thread Fengdong Yu
hiveContext.read.format(“orc”).load(“bypath/*”)



> On Nov 24, 2015, at 1:07 PM, Renu Yadav  wrote:
> 
> Hi ,
> 
> I am using dataframe and want to load orc file using multiple directory
> like this:
> hiveContext.read.format.load("mypath/3660,myPath/3661")
> 
> but it is not working.
> 
> Please suggest how to achieve this
> 
> Thanks & Regards,
> Renu Yadav


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dataframe constructor

2015-11-23 Thread Fengdong Yu
just simple as:

val df = sqlContext.sql(“select * from table”)
or

val df = sqlContext.read.json(“hdfs_path”)




> On Nov 24, 2015, at 3:09 AM, spark_user_2015  wrote:
> 
> Dear all,
> 
> is the following usage of the Dataframe constructor correct or does it
> trigger any side effects that I should be aware of?
> 
> My goal is to keep track of my dataframe's state and allow custom
> transformations accordingly.
> 
>  val df: Dataframe = ...some dataframe...
>  val newDf = new DF(df.sqlContext, df.queryExecution.logical) with StateA 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-constructor-tp25455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to adjust Spark shell table width

2015-11-21 Thread Fengdong Yu
Hi,

I found if the column value is too long, spark shell only show a partial result.

such as:

sqlContext.sql("select url from tableA”).show(10)

it cannot show the whole URL here. so how to adjust it?  Thanks






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Fengdong Yu
The simplest way is remove all “provided” in your pom.

then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ 
because assembly already includes all dependencies.






> On Nov 18, 2015, at 2:15 PM, Jack Yang  wrote:
> 
> So weird. Is there anything wrong with the way I made the pom file (I 
> labelled them as provided)?
>  
> Is there missing jar I forget to add in “--jar”?
>  
> See the trace below:
>  
>  
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> breeze/storage/DefaultArrayValue
> at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: breeze.storage.DefaultArrayValue
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 10 more
> 15/11/18 17:15:15 INFO util.Utils: Shutdown hook called
>  
>  
> From: Ted Yu [mailto:yuzhih...@gmail.com] 
> Sent: Wednesday, 18 November 2015 4:01 PM
> To: Jack Yang
> Cc: user@spark.apache.org
> Subject: Re: spark with breeze error of NoClassDefFoundError
>  
> Looking in local maven repo, breeze_2.10-0.7.jar contains DefaultArrayValue :
>  
> jar tvf 
> /Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
> grep !$
> jar tvf 
> /Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
> grep DefaultArrayValue
>369 Wed Mar 19 11:18:32 PDT 2014 
> breeze/storage/DefaultArrayValue$mcZ$sp$class.class
>309 Wed Mar 19 11:18:32 PDT 2014 
> breeze/storage/DefaultArrayValue$mcJ$sp.class
>   2233 Wed Mar 19 11:18:32 PDT 2014 
> breeze/storage/DefaultArrayValue$DoubleDefaultArrayValue$.class
>  
> Can you show the complete stack trace ?
>  
> FYI
>  
> On Tue, Nov 17, 2015 at 8:33 PM, Jack Yang  > wrote:
> Hi all,
> I am using spark 1.4.0, and building my codes using maven.
> So in one of my scala, I used:
>  
> import breeze.linalg._
> val v1 = new breeze.linalg.SparseVector(commonVector.indices, 
> commonVector.values, commonVector.size)   
> val v2 = new breeze.linalg.SparseVector(commonVector2.indices, 
> commonVector2.values, commonVector2.size)
> println (v1.dot(v2) / (norm(v1) * norm(v2)) )
>  
>  
>  
> in my pom.xml file, I used:
> 
>  org.scalanlp
>  
> breeze-math_2.10
>  0.4
>  provided
>   
>  
>   
>  org.scalanlp
>  
> breeze_2.10
>  0.11.2
>  provided
>   
>  
>  
> When submit, I included breeze jars (breeze_2.10-0.11.2.jar 
> breeze-math_2.10-0.4.jar breeze-natives_2.10-0.11.2.jar 
> breeze-process_2.10-0.3.jar) using “--jar” arguments, although I doubt it is 
> necessary to do that.
>  
> however, the error is
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> breeze/storage/DefaultArrayValue
>  
> Any thoughts?
>  
>  
>  
> Best regards,
> Jack



Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Hi,

we use ‘Airflow'  as our job workflow scheduler.




> On Nov 19, 2015, at 9:47 AM, Vikram Kone  wrote:
> 
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with command 
> job type.
> I see that when I press kill in azkaban portal on a spark-submit job, it 
> doesn't actually kill the application on spark master and it continues to run 
> even though azkaban thinks that it's killed.
> How do you get around this? Is there a way to kill the spark-submit jobs from 
> azkaban portal?
> 
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath  > wrote:
> Hi Vikram,
> 
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use 
> local mode deployment and it is fairly easy to set up. It is pretty easy to 
> use and has a nice scheduling and logging interface, as well as SLAs (like 
> kill job and notify if it doesn't complete in 3 hours or whatever). 
> 
> However Spark support is not present directly - we run everything with shell 
> scripts and spark-submit. There is a plugin interface where one could create 
> a Spark plugin, but I found it very cumbersome when I did investigate and 
> didn't have the time to work through it to develop that.
> 
> It has some quirks and while there is actually a REST API for adding jos and 
> dynamically scheduling jobs, it is not documented anywhere so you kinda have 
> to figure it out for yourself. But in terms of ease of use I found it way 
> better than Oozie. I haven't tried Chronos, and it seemed quite involved to 
> set up. Haven't tried Luigi either.
> 
> Spark job server is good but as you say lacks some stuff like scheduling and 
> DAG type workflows (independent of spark-defined job flows).
> 
> 
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  > wrote:
> Check also falcon in combination with oozie
> 
> Le ven. 7 août 2015 à 17:51, Hien Luu  a écrit :
> Looks like Oozie can satisfy most of your requirements. 
> 
> 
> 
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  > wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule 
> spark jobs on a datastax cassandra cluster. Since there are tonnes of 
> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to 
> check with people here to see what they are using today.
> 
> Some of the requirements of the workflow engine that I'm looking for are
> 
> 1. First class support for submitting Spark jobs on Cassandra. Not some 
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface 
> . Ex; job A depends on Job B and Job C, so run Job A after B and C are 
> finished. Don't need to write full blown java applications to specify job 
> parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time every 
> hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily 
> basis.
> 
> I have looked at Ooyala's spark job server which seems to be hated towards 
> making spark jobs run faster by sharing contexts between the jobs but isn't a 
> full blown workflow engine per se. A combination of spark job server and 
> workflow engine would be ideal 
> 
> Thanks for the inputs
> 
> 
> 



Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Yes, you can submit job remotely.



> On Nov 19, 2015, at 10:10 AM, Vikram Kone <vikramk...@gmail.com> wrote:
> 
> Hi Feng,
> Does airflow allow remote submissions of spark jobs via spark-submit?
> 
> On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Hi,
> 
> we use ‘Airflow'  as our job workflow scheduler.
> 
> 
> 
> 
>> On Nov 19, 2015, at 9:47 AM, Vikram Kone <vikramk...@gmail.com 
>> <mailto:vikramk...@gmail.com>> wrote:
>> 
>> Hi Nick,
>> Quick question about spark-submit command executed from azkaban with command 
>> job type.
>> I see that when I press kill in azkaban portal on a spark-submit job, it 
>> doesn't actually kill the application on spark master and it continues to 
>> run even though azkaban thinks that it's killed.
>> How do you get around this? Is there a way to kill the spark-submit jobs 
>> from azkaban portal?
>> 
>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <nick.pentre...@gmail.com 
>> <mailto:nick.pentre...@gmail.com>> wrote:
>> Hi Vikram,
>> 
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use 
>> local mode deployment and it is fairly easy to set up. It is pretty easy to 
>> use and has a nice scheduling and logging interface, as well as SLAs (like 
>> kill job and notify if it doesn't complete in 3 hours or whatever). 
>> 
>> However Spark support is not present directly - we run everything with shell 
>> scripts and spark-submit. There is a plugin interface where one could create 
>> a Spark plugin, but I found it very cumbersome when I did investigate and 
>> didn't have the time to work through it to develop that.
>> 
>> It has some quirks and while there is actually a REST API for adding jos and 
>> dynamically scheduling jobs, it is not documented anywhere so you kinda have 
>> to figure it out for yourself. But in terms of ease of use I found it way 
>> better than Oozie. I haven't tried Chronos, and it seemed quite involved to 
>> set up. Haven't tried Luigi either.
>> 
>> Spark job server is good but as you say lacks some stuff like scheduling and 
>> DAG type workflows (independent of spark-defined job flows).
>> 
>> 
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>> wrote:
>> Check also falcon in combination with oozie
>> 
>> Le ven. 7 août 2015 à 17:51, Hien Luu <h...@linkedin.com.invalid 
>> <mailto:h...@linkedin.com.invalid>> a écrit :
>> Looks like Oozie can satisfy most of your requirements. 
>> 
>> 
>> 
>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramk...@gmail.com 
>> <mailto:vikramk...@gmail.com>> wrote:
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to schedule 
>> spark jobs on a datastax cassandra cluster. Since there are tonnes of 
>> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to 
>> check with people here to see what they are using today.
>> 
>> Some of the requirements of the workflow engine that I'm looking for are
>> 
>> 1. First class support for submitting Spark jobs on Cassandra. Not some 
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production scale.
>> 3. Should be dead easy to write job dependencices using XML or web interface 
>> . Ex; job A depends on Job B and Job C, so run Job A after B and C are 
>> finished. Don't need to write full blown java applications to specify job 
>> parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time 
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily 
>> basis.
>> 
>> I have looked at Ooyala's spark job server which seems to be hated towards 
>> making spark jobs run faster by sharing contexts between the jobs but isn't 
>> a full blown workflow engine per se. A combination of spark job server and 
>> workflow engine would be ideal 
>> 
>> Thanks for the inputs
>> 
>> 
>> 
> 
> 



Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
Can you try : new PixelGenerator(startTime, endTime) ?



> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu  wrote:
> 
> I want to pass two parameters into new java class from rdd.mapPartitions(), 
> the code  like following.
> ---Source Code
> 
> Main method:
> 
> /*the parameters that I want to pass into the PixelGenerator.class for 
> selecting any items between the startTime and the endTime.
> 
> */
> 
> int startTime, endTime;   
> 
> JavaRDD pixelsObj = pixelsStr.mapPartitions(new 
> PixelGenerator());
> 
> 
> PixelGenerator.java
> 
> public class PixelGenerator implements FlatMapFunction PixelObject> {
> 
> 
> public Iterable call(Iterator lines) {
> 
> ...
> 
> }
> 
> Can anyone told me how to pass the startTime, endTime into PixelGenerator 
> class?
> 
> Many Thanks
> 
> 
> This message and its attachments may contain legally privileged or 
> confidential information. It is intended solely for the named addressee. If 
> you are not the addressee indicated in this message or responsible for 
> delivery of the message to the addressee, you may not copy or deliver this 
> message or its attachments to anyone. Rather, you should permanently delete 
> this message and its attachments and kindly notify the sender by reply 
> e-mail. Any content of this message and its attachments which does not relate 
> to the official business of the sending company must be taken not to have 
> been sent or endorsed by that company or any of its related entities. No 
> warranty is made that the e-mail or attachments are free from computer virus 
> or other defect.



Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
Just make PixelGenerator as a nested static class?






> On Nov 16, 2015, at 1:22 PM, Zhang, Jingyu  wrote:
> 
> Fengdong



Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
If you got “cannot  Serialized” Exception, then you need to  PixelGenerator as 
a Static class.




> On Nov 16, 2015, at 1:10 PM, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote:
> 
> Thanks, that worked for local environment but not in the Spark Cluster.
> 
> 
> On 16 November 2015 at 16:05, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Can you try : new PixelGenerator(startTime, endTime) ?
> 
> 
> 
>> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu <jingyu.zh...@news.com.au 
>> <mailto:jingyu.zh...@news.com.au>> wrote:
>> 
>> I want to pass two parameters into new java class from rdd.mapPartitions(), 
>> the code  like following.
>> ---Source Code
>> 
>> Main method:
>> 
>> /*the parameters that I want to pass into the PixelGenerator.class for 
>> selecting any items between the startTime and the endTime.
>> 
>> */
>> 
>> int startTime, endTime;   
>> 
>> JavaRDD pixelsObj = pixelsStr.mapPartitions(new 
>> PixelGenerator());
>> 
>> 
>> PixelGenerator.java
>> 
>> public class PixelGenerator implements FlatMapFunction<Iterator, 
>> PixelObject> {
>> 
>> 
>> public Iterable call(Iterator lines) {
>> 
>> ...
>> 
>> }
>> 
>> Can anyone told me how to pass the startTime, endTime into PixelGenerator 
>> class?
>> 
>> Many Thanks
>> 
>> 
>> This message and its attachments may contain legally privileged or 
>> confidential information. It is intended solely for the named addressee. If 
>> you are not the addressee indicated in this message or responsible for 
>> delivery of the message to the addressee, you may not copy or deliver this 
>> message or its attachments to anyone. Rather, you should permanently delete 
>> this message and its attachments and kindly notify the sender by reply 
>> e-mail. Any content of this message and its attachments which does not 
>> relate to the official business of the sending company must be taken not to 
>> have been sent or endorsed by that company or any of its related entities. 
>> No warranty is made that the e-mail or attachments are free from computer 
>> virus or other defect.
> 
> 
> 
> This message and its attachments may contain legally privileged or 
> confidential information. It is intended solely for the named addressee. If 
> you are not the addressee indicated in this message or responsible for 
> delivery of the message to the addressee, you may not copy or deliver this 
> message or its attachments to anyone. Rather, you should permanently delete 
> this message and its attachments and kindly notify the sender by reply 
> e-mail. Any content of this message and its attachments which does not relate 
> to the official business of the sending company must be taken not to have 
> been sent or endorsed by that company or any of its related entities. No 
> warranty is made that the e-mail or attachments are free from computer virus 
> or other defect.



Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
The code looks good. can you check your ‘import’ in your code?  because it 
calls ‘honeywell.test’?





> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas  wrote:
> 
> Hi,
> 
> While I am trying to read a json file using SQLContext, i get the
> following error:
> 
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V
>at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 
> 
> I am using pom.xml with following dependencies and versions:
> spark-core_2.11 with version 1.5.1
> spark-streaming_2.11 with version 1.5.1
> spark-sql_2.11 with version 1.5.1
> 
> Can anyone please help me out in resolving this ?
> 
> Regards,
> Yogesh
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
Ignore my inputs, I think HiveSpark.java is your main method located.

can you paste the whole pom.xml and your code?




> On Nov 16, 2015, at 3:39 PM, Fengdong Yu <fengdo...@everstring.com> wrote:
> 
> The code looks good. can you check your ‘import’ in your code?  because it 
> calls ‘honeywell.test’?
> 
> 
> 
> 
> 
>> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas <informy...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> While I am trying to read a json file using SQLContext, i get the
>> following error:
>> 
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V
>>   at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15)
>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>   at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>   at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>   at java.lang.reflect.Method.invoke(Method.java:597)
>>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> 
>> 
>> I am using pom.xml with following dependencies and versions:
>> spark-core_2.11 with version 1.5.1
>> spark-streaming_2.11 with version 1.5.1
>> spark-sql_2.11 with version 1.5.1
>> 
>> Can anyone please help me out in resolving this ?
>> 
>> Regards,
>> Yogesh
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
And, also make sure your scala version is 2.11 for your build. 



> On Nov 16, 2015, at 3:43 PM, Fengdong Yu <fengdo...@everstring.com> wrote:
> 
> Ignore my inputs, I think HiveSpark.java is your main method located.
> 
> can you paste the whole pom.xml and your code?
> 
> 
> 
> 
>> On Nov 16, 2015, at 3:39 PM, Fengdong Yu <fengdo...@everstring.com> wrote:
>> 
>> The code looks good. can you check your ‘import’ in your code?  because it 
>> calls ‘honeywell.test’?
>> 
>> 
>> 
>> 
>> 
>>> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas <informy...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> While I am trying to read a json file using SQLContext, i get the
>>> following error:
>>> 
>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>> org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V
>>>  at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15)
>>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>  at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>  at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>  at java.lang.reflect.Method.invoke(Method.java:597)
>>>  at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> 
>>> 
>>> I am using pom.xml with following dependencies and versions:
>>> spark-core_2.11 with version 1.5.1
>>> spark-streaming_2.11 with version 1.5.1
>>> spark-sql_2.11 with version 1.5.1
>>> 
>>> Can anyone please help me out in resolving this ?
>>> 
>>> Regards,
>>> Yogesh
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
what’s your SQL?




> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas  wrote:
> 
> Hi,
> 
> While I am trying to read a json file using SQLContext, i get the
> following error:
> 
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V
>at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 
> 
> I am using pom.xml with following dependencies and versions:
> spark-core_2.11 with version 1.5.1
> spark-streaming_2.11 with version 1.5.1
> spark-sql_2.11 with version 1.5.1
> 
> Can anyone please help me out in resolving this ?
> 
> Regards,
> Yogesh
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw.



> On Nov 11, 2015, at 12:49 AM, Reynold Xin  wrote:
> 
> Hi All,
> 
> Spark 1.5.2 is a maintenance release containing stability fixes. This release 
> is based on the branch-1.5 maintenance branch of Spark. We *strongly 
> recommend* all 1.5.x users to upgrade to this release.
> 
> The full list of bug fixes is here: http://s.apache.org/spark-1.5.2 
> 
> 
> http://spark.apache.org/releases/spark-release-1-5-2.html 
> 
> 
> 



Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Yes, that’s the problem.
http://search.maven.org/#artifactdetails%7Ccom.twitter%7Cparquet-avro%7C1.6.0%7Cjar
 <http://search.maven.org/#artifactdetails|com.twitter|parquet-avro|1.6.0|jar>

this depends on parquet-hadoop-1.6.0,  then triggered this bug.

can you change the version to 1.6.0rc7 manually ?




> On Nov 9, 2015, at 9:34 PM, swetha kasireddy <swethakasire...@gmail.com> 
> wrote:
> 
> I am using the following:
> 
> 
> 
> com.twitter
> parquet-avro
> 1.6.0
> 
> 
> On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Which Spark version used?
> 
> It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
> 
> 
> 
> 
> > On Nov 9, 2015, at 3:43 PM, swetha <swethakasire...@gmail.com 
> > <mailto:swethakasire...@gmail.com>> wrote:
> >
> > Hi,
> >
> > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
> > Please find below the code and the Warning message. Any idea as to how to
> > avoid the unwanted Warning message?
> >
> > activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
> > classOf[ActiveSession],
> >  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)
> >
> > Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
> > could not write summary file for active_sessions_current
> > parquet.io.ParquetEncodingException:
> > maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
> > all the files must be contained in the root active_sessions_current
> >   at
> > parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> >   at
> > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> >   at
> > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> >   at
> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
> >   at
> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
> >  
> > <http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html>
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> 



Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Which Spark version used?

It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.




> On Nov 9, 2015, at 3:43 PM, swetha  wrote:
> 
> Hi,
> 
> I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
> Please find below the code and the Warning message. Any idea as to how to
> avoid the unwanted Warning message?
> 
> activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
> classOf[ActiveSession],
>  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)
> 
> Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
> could not write summary file for active_sessions_current
> parquet.io.ParquetEncodingException:
> maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
> all the files must be contained in the root active_sessions_current
>   at
> parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
>   at
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
>   at
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Fengdong Yu
Does this released with Spark1.*? or still kept in the trunk?




> On Oct 27, 2015, at 6:22 PM, Adrian Tanase  wrote:
> 
> Also I just remembered about cloudera’s contribution
> http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
>  
> 
> 
> From: Deng Ching-Mallete
> Date: Tuesday, October 27, 2015 at 12:03 PM
> To: avivb
> Cc: user
> Subject: Re: There is any way to write from spark to HBase CDH4?
> 
> Hi,
> 
> We are using phoenix-spark (http://phoenix.apache.org/phoenix_spark.html 
> ) to write data to HBase, but 
> it requires spark 1.3.1+ and phoenix 4.4+. Previously, when we were still on 
> spark 1.2, we used the HBase API to write directly to HBase.
> 
> For HBase 0.98, it's something like this:
> 
> rdd.foreachPartition(partition => {
>// create hbase config
>val hConf = HBaseConfiguration.create()
>val hTable = new HTable(hConf, "TABLE_1")
>hTable.setAutoFlush(false)
> 
>partition.foreach(r => {
>  // generate row key
>  // create row
>  val hRow = new Put(rowKey)
> 
>  // add columns 
>  hRow.add(..)
> 
>  hTable.put(hRow)
>})
>hTable.flushCommits()
>hTable.close()
> })
> 
> HTH,
> Deng
> 
> On Tue, Oct 27, 2015 at 5:36 PM, avivb  > wrote:
>> I have already try it with https://github.com/unicredit/hbase-rdd 
>>  and
>> https://github.com/nerdammer/spark-hbase-connector 
>>  and in both cases I get
>> timeout.
>> 
>> So I would like to know about other option to write from Spark to HBase
>> CDH4.
>> 
>> Thanks!
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/There-is-any-way-to-write-from-spark-to-HBase-CDH4-tp25209.html
>>  
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>>  



Re: spark to hbase

2015-10-27 Thread Fengdong Yu
Also, please remove the HBase related to the Scala Object, this will resolve 
the serialize issue and avoid open connection repeatedly.

and remember close the table after the final flush.



> On Oct 28, 2015, at 10:13 AM, Ted Yu  wrote:
> 
> For #2, have you checked task log(s) to see if there was some clue ?
> 
> You may want to use foreachPartition to reduce the number of flushes.
> 
> In the future, please remove color coding - it is not easy to read.
> 
> Cheers
> 
> On Tue, Oct 27, 2015 at 6:53 PM, jinhong lu  > wrote:
> Hi, Ted
> 
> thanks for your help.
> 
> I check the jar, it is in classpath, and now the problem is :
> 
> 1、 Follow codes runs good, and it put the  result to hbse:
> 
>   val res = 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.first()
>  val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>   configuration.set("hbase.master", "192.168.1.66:6 
> ");
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
>   table.flushCommits()
> 
> 2、But if I change the first() function to foreach:
> 
>   
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({res=>
>   val configuration = HBaseConfiguration.create();
>   configuration.set("hbase.zookeeper.property.clientPort", "2181");
>   configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
>   configuration.set("hbase.master", "192.168.1.66:6 
> ");
>   val table = new HTable(configuration, "ljh_test3");
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
> 
> })
> 
> the application hung, and the last log is :
> 
> 15/10/28 09:30:33 INFO DAGScheduler: Missing parents for ResultStage 2: List()
> 15/10/28 09:30:33 INFO DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[6] at values at TrainModel3.scala:98), which is now runnable
> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(7032) called with 
> curMem=264045, maxMem=278302556
> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3 stored as values in 
> memory (estimated size 6.9 KB, free 265.2 MB)
> 15/10/28 09:30:33 INFO MemoryStore: ensureFreeSpace(3469) called with 
> curMem=271077, maxMem=278302556
> 15/10/28 09:30:33 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes 
> in memory (estimated size 3.4 KB, free 265.1 MB)
> 15/10/28 09:30:33 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on 10.120.69.53:43019 (size: 3.4 KB, free: 265.4 MB)
> 15/10/28 09:30:33 INFO SparkContext: Created broadcast 3 from broadcast at 
> DAGScheduler.scala:874
> 15/10/28 09:30:33 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 2 (MapPartitionsRDD[6] at values at TrainModel3.scala:98)
> 15/10/28 09:30:33 INFO YarnScheduler: Adding task set 2.0 with 1 tasks
> 15/10/28 09:30:33 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
> gdc-dn147-formal.i.nease.net , 
> PROCESS_LOCAL, 1716 bytes)
> 15/10/28 09:30:34 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on gdc-dn147-formal.i.nease.net:59814 (size: 3.4 KB, free: 1060.3 MB)
> 15/10/28 09:30:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
> output locations for shuffle 0 to gdc-dn147-formal.i.nease.net:52904
> 15/10/28 09:30:34 INFO MapOutputTrackerMaster: Size of output statuses for 
> shuffle 0 is 154 bytes
> 
> 3、besides, I take the configuration and HTable out of foreach:
> 
> val configuration = HBaseConfiguration.create();
> configuration.set("hbase.zookeeper.property.clientPort", "2181");
> configuration.set("hbase.zookeeper.quorum", "192.168.1.66");
> configuration.set("hbase.master", "192.168.1.66:6");
> val table = new HTable(configuration, "ljh_test3");
> 
> lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
>  TrainFeature())(seqOp, combOp).values.foreach({ res =>
> 
>   var put = new Put(Bytes.toBytes(res.toKey()));
>   put.add(Bytes.toBytes("f"), Bytes.toBytes("c"), 
> Bytes.toBytes(res.positiveCount));
>   table.put(put);
> 
> })
> table.flushCommits()
> 
> found serializable problem:
> 
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
> at 
> 

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Fengdong Yu
How many partitions you generated?
if Millions generated, then there is a huge memory consumed.





> On Oct 26, 2015, at 10:58 AM, Jerry Lam  wrote:
> 
> Hi guys,
> 
> I mentioned that the partitions are generated so I tried to read the 
> partition data from it. The driver is OOM after few minutes. The stack trace 
> is below. It looks very similar to the the jstack above (note on the refresh 
> method). Thanks!
> 
> Name: java.lang.OutOfMemoryError
> Message: GC overhead limit exceeded
> StackTrace: java.util.Arrays.copyOf(Arrays.java:2367)
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> java.lang.StringBuilder.append(StringBuilder.java:132)
> org.apache.hadoop.fs.Path.toString(Path.java:384)
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447)
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447)
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(interfaces.scala:447)
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh(interfaces.scala:453)
> org.apache.spark.sql.sources.HadoopFsRelation.org 
> $apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute(interfaces.scala:465)
> org.apache.spark.sql.sources.HadoopFsRelation.org 
> $apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache(interfaces.scala:463)
> org.apache.spark.sql.sources.HadoopFsRelation.cachedLeafStatuses(interfaces.scala:470)
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:381)
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org 
> $apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache$lzycompute(ParquetRelation.scala:145)
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org 
> $apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache(ParquetRelation.scala:143)
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:196)
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:196)
> scala.Option.getOrElse(Option.scala:120)
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:196)
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395)
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267)
> 
> On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam  > wrote:
> Hi Josh,
> 
> No I don't have speculation enabled. The driver took about few hours until it 
> was OOM. Interestingly, all partitions are generated successfully (_SUCCESS 
> file is written in the output directory). Is there a reason why the driver 
> needs so much memory? The jstack revealed that it called refresh some file 
> statuses. Is there a way to avoid OutputCommitCoordinator to use so much 
> memory? 
> 
> Ultimately, I choose to use partitions because most of the queries I have 
> will execute based the partition field. For example, "SELECT events from 
> customer where customer_id = 1234". If the partition is based on customer_id, 
> all events for a customer can be easily retrieved without filtering the 
> entire dataset which is much more efficient (I hope). However, I notice that 
> the implementation of the partition logic does not seem to allow this type of 
> use cases without using a lot of memory which is a bit odd in my opinion. Any 
> help will be greatly appreciated.
> 
> Best Regards,
> 
> Jerry
> 
> 
> 
> On Sun, Oct 25, 2015 at 9:25 PM, Josh Rosen  > 

Re: Machine learning with spark (book code example error)

2015-10-14 Thread Fengdong Yu
Don’t recommend this code style, you’d better brace the function block.

val testLabels = testRDD.map { case (file, text) => {
  val topic = file.split("/").takeRight(2).head
 newsgroupsMap(topic)
} }


> On Oct 14, 2015, at 15:46, Nick Pentreath  wrote:
> 
> Hi there. I'm the author of the book (thanks for buying it by the way :)
> 
> Ideally if you're having any trouble with the book or code, it's best to 
> contact the publisher and submit a query 
> (https://www.packtpub.com/books/content/support/17400 
> ) 
> 
> However, I can help with this issue. The problem is that the "testLabels" 
> code needs to be indented over multiple lines:
> 
> val testPath = "/PATH/20news-bydate-test/*"
> val testRDD = sc.wholeTextFiles(testPath)
> val testLabels = testRDD.map { case (file, text) => 
>   val topic = file.split("/").takeRight(2).head
>   newsgroupsMap(topic)
> }
> 
> As it is in the sample code attached. If you copy the whole indented block 
> (or line by line) into the console, it should work - I've tested all the 
> sample code again and indeed it works for me.
> 
> Hope this helps
> Nick
> 
> On Tue, Oct 13, 2015 at 8:31 PM, Zsombor Egyed  > wrote:
> Hi!
> 
> I was reading the ML with spark book, and I was very interested about the 9. 
> chapter (text mining), so I tried code examples. 
> 
> Everything was fine, but in this line:
> val testLabels = testRDD.map { 
> case (file, text) => val topic = file.split("/").takeRight(2).head
> newsgroupsMap(topic) }
> I got an error: "value newsgroupsMap is not a member of String"
> 
> Other relevant part of the code:
> val path = "/PATH/20news-bydate-train/*"
> val rdd = sc.wholeTextFiles(path) 
> val newsgroups = rdd.map { case (file, text) => 
> file.split("/").takeRight(2).head }
> 
> val tf = hashingTF.transform(tokens)
> val idf = new IDF().fit(tf)
> val tfidf = idf.transform(tf)
> 
> val newsgroupsMap = newsgroups.distinct.collect().zipWithIndex.toMap
> val zipped = newsgroups.zip(tfidf)
> val train = zipped.map { case (topic, vector) 
> =>LabeledPoint(newsgroupsMap(topic), vector) }
> train.cache
> 
> val model = NaiveBayes.train(train, lambda = 0.1)
> 
> val testPath = "/PATH//20news-bydate-test/*"
> val testRDD = sc.wholeTextFiles(testPath)
> val testLabels = testRDD.map { case (file, text) => val topic = 
> file.split("/").takeRight(2).head newsgroupsMap(topic) }
> 
> I attached the whole program code. 
> Can anyone help, what the problem is?
> 
> Regards,
> Zsombor
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 



Re: how to use SharedSparkContext

2015-10-14 Thread Fengdong Yu
oh, 
Yes. Thanks much.



> On Oct 14, 2015, at 18:47, Akhil Das  wrote:
> 
> com.holdenkarau.spark.testing



Re: spark sql OOM

2015-10-14 Thread Fengdong Yu
Can you search the mail-archive before asked the question? at least search for 
how ask the question.

nobody can give your answer if you don’t paste your SQL or SparkSQL code.


> On Oct 14, 2015, at 17:40, Andy Zhao  wrote:
> 
> Hi guys, 
> 
> I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cdh5.3.2. 
> One sql which can ran successfully using hive failed when I ran it using
> sparkSql. 
> I got the following errno: 
> 
> 
>  
> 
> I read the source code, It seems that the compute method of HadoopRDD is
> called infinite times, every time it got called, some new instance need to
> be allocated on the heap and finally OOM. 
> 
> Does anyone have same problem? 
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-OOM-tp25060.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to use SharedSparkContext

2015-10-12 Thread Fengdong Yu
Hi, 
How to add dependency in build.sbt  if I want to use SharedSparkContext?

I’ve added spark-core, but it doesn’t work.(cannot find SharedSparkContext)



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org