Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Fengdong Yu
val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" ) Azuryy Yu Sr. Infrastructure Engineer cel: 158-0164-9103 wetchat: azuryy On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj < prashant2006s...@gmail.com> wrote: > Hi > > I have two columns in my json which can have null,

Re: getting error while persisting in hive

2015-12-09 Thread Fengdong Yu
.write not .write() > On Dec 9, 2015, at 5:37 PM, Divya Gehlot wrote: > > Hi, > I am using spark 1.4.1 . > I am getting error when persisting spark dataframe output to hive > scala> > df.select("name","age").write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("Per

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java

Re: About Spark On Hbase

2015-12-08 Thread Fengdong Yu
https://github.com/nerdammer/spark-hbase-connector This is better and easy to use. > On Dec 9, 2015, at 3:04 PM, censj wrote: > > hi all, > now I using spark,but I not found spark operation hbase open source. Do > any one tell me? >

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
/in/ramkumarcs31> > > > On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu <mailto:fengdo...@everstring.com>> wrote: > Can you detail your question? what looks like your previous batch and the > current batch? > > > > > >> On Dec 8, 2015, at 3:5

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question? what looks like your previous batch and the current batch? > On Dec 8, 2015, at 3:52 PM, Ramkumar V wrote: > > Hi, > > I'm running java over spark in cluster mode. I want to apply filter on > javaRDD based on some previous batch values. if i store those valu

Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Fengdong Yu
Can you try like this in your sbt: val spark_version = "1.5.2" val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact = "servlet-api") val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty") libraryDependencies ++= Seq( "org.apache.spark" %% "spark

Re: python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Fengdong Yu
refer here: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html of section: Example 4-27. Python custom partitioner > On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote: > > I'm not a python expert, so I'm wondering if anybody has a working

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
I suppose your output data is “ORC”, and want to save to hive database: test, external table name is : testTable import scala.collection.immutable sqlContext.createExternalTable(“test.testTable", "org.apache.spark.sql.hive.orc", Map("path" -> “/data/test/mydata")) > On Dec 7, 2015, at 5:28

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
If your RDD is JSON format, that’s easy. val df = sqlContext.read.json(rdd) df.saveAsTable(“your_table_name") > On Dec 7, 2015, at 5:28 PM, Divya Gehlot wrote: > > Hi, > I am new bee to Spark. > Could somebody guide me how can I persist my spark RDD results in Hive using > SaveAsTable API.

Re: Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Fengdong Yu
Don’t do Join firstly. broadcast your small RDD, val bc = sc.broadcast(small_rdd) then large_dd.filter(x.key in bc.value).map( x => { bc.value.other_fileds + x }).distinct.groupByKey > On Dec 7, 2015, at 1:41 PM, Z Z wrote: > > I have two RDDs, one really large in size and other

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
Yes. it results to a shuffle. > On Dec 4, 2015, at 6:04 PM, Stephen Boesch wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu <mailto:fengdo...@everstring.com>>: > T

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
There are many ways, one simple is: such as: you want to know how many rows for each month: sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count the output looks like: monthcount 201411100 201412200 hopes help. > On Dec 4, 2015, at 5:53 PM, Yiannis

Re: sparkSQL Load multiple tables

2015-12-02 Thread Fengdong Yu
It cannot read multiple tables, but if your tables have the same columns, you can read them one by one, then unionAll them, such as: val df1 = sqlContext.table(“table1”) val df2 = sqlContext.table(“table2”) val df = df1.unionAll(df2) > On Dec 2, 2015, at 4:06 PM, censj wrote: > > Dear a

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > > df.write.partitionBy("date").i

Re: Low Latency SQL query

2015-12-01 Thread Fengdong Yu
It depends on many situations: 1) what’s your data format? csv(text) or ORC/parquet? 2) Did you have Data warehouse to summary/cluster your data? if your data is text or you query for the raw data, It should be slow, Spark cannot do much to optimize your job. > On Dec 2, 2015, at 9:21 AM,

Re: load multiple directory using dataframe load

2015-11-23 Thread Fengdong Yu
hiveContext.read.format(“orc”).load(“bypath/*”) > On Nov 24, 2015, at 1:07 PM, Renu Yadav wrote: > > Hi , > > I am using dataframe and want to load orc file using multiple directory > like this: > hiveContext.read.format.load("mypath/3660,myPath/3661") > > but it is not working. > > Please

Re: Dataframe constructor

2015-11-23 Thread Fengdong Yu
just simple as: val df = sqlContext.sql(“select * from table”) or val df = sqlContext.read.json(“hdfs_path”) > On Nov 24, 2015, at 3:09 AM, spark_user_2015 wrote: > > Dear all, > > is the following usage of the Dataframe constructor correct or does it > trigger any side effects that I shou

How to adjust Spark shell table width

2015-11-21 Thread Fengdong Yu
Hi, I found if the column value is too long, spark shell only show a partial result. such as: sqlContext.sql("select url from tableA”).show(10) it cannot show the whole URL here. so how to adjust it? Thanks - To unsubscr

Re: has any spark write orc document

2015-11-19 Thread Fengdong Yu
You can use DataFrame: sqlContext.write.format(“orc”).save(“") > On Nov 20, 2015, at 2:59 PM, zhangjp <592426...@qq.com> wrote: > > Hi, > has any spark write orc document which like the parquet document. > http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files >

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Yes, you can submit job remotely. > On Nov 19, 2015, at 10:10 AM, Vikram Kone wrote: > > Hi Feng, > Does airflow allow remote submissions of spark jobs via spark-submit? > > On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu <mailto:fengdo...@everstring.com>> wrote:

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Hi, we use ‘Airflow' as our job workflow scheduler. > On Nov 19, 2015, at 9:47 AM, Vikram Kone wrote: > > Hi Nick, > Quick question about spark-submit command executed from azkaban with command > job type. > I see that when I press kill in azkaban portal on a spark-submit job, it > doesn'

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Fengdong Yu
The simplest way is remove all “provided” in your pom. then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ because assembly already includes all dependencies. > On Nov 18, 2015, at 2:15 PM, Jack Yang wrote: > > So weird. Is there anything wrong with the way I made the

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
And, also make sure your scala version is 2.11 for your build. > On Nov 16, 2015, at 3:43 PM, Fengdong Yu wrote: > > Ignore my inputs, I think HiveSpark.java is your main method located. > > can you paste the whole pom.xml and your code? > > > > >> On N

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
Ignore my inputs, I think HiveSpark.java is your main method located. can you paste the whole pom.xml and your code? > On Nov 16, 2015, at 3:39 PM, Fengdong Yu wrote: > > The code looks good. can you check your ‘import’ in your code? because it > calls ‘ho

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
The code looks good. can you check your ‘import’ in your code? because it calls ‘honeywell.test’? > On Nov 16, 2015, at 3:02 PM, Yogesh Vyas wrote: > > Hi, > > While I am trying to read a json file using SQLContext, i get the > following error: > > Exception in thread "main" java.lang.No

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
what’s your SQL? > On Nov 16, 2015, at 3:02 PM, Yogesh Vyas wrote: > > Hi, > > While I am trying to read a json file using SQLContext, i get the > following error: > > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaS

Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
Just make PixelGenerator as a nested static class? > On Nov 16, 2015, at 1:22 PM, Zhang, Jingyu wrote: > > Fengdong

Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
If you got “cannot Serialized” Exception, then you need to PixelGenerator as a Static class. > On Nov 16, 2015, at 1:10 PM, Zhang, Jingyu wrote: > > Thanks, that worked for local environment but not in the Spark Cluster. > > > On 16 November 2015 at 16:05, Fengdong

Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
Can you try : new PixelGenerator(startTime, endTime) ? > On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu wrote: > > I want to pass two parameters into new java class from rdd.mapPartitions(), > the code like following. > ---Source Code > > Main method: > > /*the parameters that I want to p

Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw. > On Nov 11, 2015, at 12:49 AM, Reynold Xin wrote: > > Hi All, > > Spark 1.5.2 is a maintenance release containing stability fixes. This release > is based on the branch-1.5 maintenance branch of Spark. We *strongly > recommend* all 1.5.x users

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
6.0rc7 manually ? > On Nov 9, 2015, at 9:34 PM, swetha kasireddy > wrote: > > I am using the following: > > > > com.twitter > parquet-avro > 1.6.0 > > > On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <mailto:fengdo...@everstring.com>> w

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Which Spark version used? It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > On Nov 9, 2015, at 3:43 PM, swetha wrote: > > Hi, > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > Please find below the code and the Warning message. Any idea as to how to

Re: spark to hbase

2015-10-27 Thread Fengdong Yu
Also, please remove the HBase related to the Scala Object, this will resolve the serialize issue and avoid open connection repeatedly. and remember close the table after the final flush. > On Oct 28, 2015, at 10:13 AM, Ted Yu wrote: > > For #2, have you checked task log(s) to see if there wa

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Fengdong Yu
Does this released with Spark1.*? or still kept in the trunk? > On Oct 27, 2015, at 6:22 PM, Adrian Tanase wrote: > > Also I just remembered about cloudera’s contribution > http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ > >

Re: Concurrent execution of actions within a driver

2015-10-26 Thread Fengdong Yu
not parallel. Spark only execute tasks with Action,(‘collect' here) rdd1.collect and rdd2.collect are executed sequencely, so Spark execute two tasks one by one. > On Oct 26, 2015, at 7:26 PM, praveen S wrote: > > Does spark run different actions of an rdd within a driver in parallel als

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Fengdong Yu
How many partitions you generated? if Millions generated, then there is a huge memory consumed. > On Oct 26, 2015, at 10:58 AM, Jerry Lam wrote: > > Hi guys, > > I mentioned that the partitions are generated so I tried to read the > partition data from it. The driver is OOM after few minut

Re: how to use SharedSparkContext

2015-10-14 Thread Fengdong Yu
oh, Yes. Thanks much. > On Oct 14, 2015, at 18:47, Akhil Das wrote: > > com.holdenkarau.spark.testing

Re: spark sql OOM

2015-10-14 Thread Fengdong Yu
Can you search the mail-archive before asked the question? at least search for how ask the question. nobody can give your answer if you don’t paste your SQL or SparkSQL code. > On Oct 14, 2015, at 17:40, Andy Zhao wrote: > > Hi guys, > > I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cd

Re: Machine learning with spark (book code example error)

2015-10-14 Thread Fengdong Yu
Don’t recommend this code style, you’d better brace the function block. val testLabels = testRDD.map { case (file, text) => { val topic = file.split("/").takeRight(2).head newsgroupsMap(topic) } } > On Oct 14, 2015, at 15:46, Nick Pentreath wrote: > > Hi there. I'm the author of the book (t

how to use SharedSparkContext

2015-10-12 Thread Fengdong Yu
Hi, How to add dependency in build.sbt if I want to use SharedSparkContext? I’ve added spark-core, but it doesn’t work.(cannot find SharedSparkContext) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additiona