Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu
n Wed, Mar 9, 2016 at 9:24 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Hadoop glob pattern doesn't support multi level wildcard. >> >> Thanks >> >> On Mar 9, 2016, at 6:15 AM, Koert Kuipers <ko...@tresata.com> wrote: >> >> if its based on

Re: HBASE

2016-03-09 Thread Ted Yu
bq. it is kind of columnar NoSQL database. The storage format in HBase is not columnar. I would suggest you build upon what you already know (Spark and Hive) and expand on that. Also, if your work uses Big Data technologies, those would be the first to consider getting to know better. On Wed,

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu
Hadoop glob pattern doesn't support multi level wildcard. Thanks > On Mar 9, 2016, at 6:15 AM, Koert Kuipers <ko...@tresata.com> wrote: > > if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation > handles globs > >> On Wed, Mar 9, 2016

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu
This is currently not supported. > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something like this possible? > > sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json") > > I switched to DataFrames because my source files changed from TSV to

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread Ted Yu
drop down menu on the right hand side of the Create button (it looks as if >> it's part of the button) - when I clicked directly on the word "Create" I >> got a form that made more sense and allowed me to choose the project. >> >> Regards, >> >>

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Ted Yu
h Bajaj > > On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> Hi Ted >> >> I believe by default cassandra listens on 9042 >> >> From: Ted Yu <yuzhih...@gmail.com> >> Date: Tuesday, March 8, 2016 a

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Ted Yu
Have you contacted spark-cassandra-connector related mailing list ? I wonder where the port 9042 came from. Cheers On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson wrote: > > I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python > notebook that reads

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread Ted Yu
That may miss the 15th minute of the hour (with non-trivial deviation), right ? On Tue, Mar 8, 2016 at 8:50 AM, ayan guha wrote: > Why not compare current time in every batch and it meets certain condition > emit the data? > On 9 Mar 2016 00:19, "Abhishek Anand"

Re: Quetions about Actor model of Computation.

2016-03-08 Thread Ted Yu
This seems related: the second paragraph under Implementation and theory https://en.wikipedia.org/wiki/Closure_(computer_programming) On Tue, Mar 8, 2016 at 4:49 AM, Minglei Zhang wrote: > hello, experts. > > I am a student. and recently, I read a paper about *Actor

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Ted Yu
Josh: SerializerInstance and SerializationStream would also become private[spark], right ? Thanks On Mon, Mar 7, 2016 at 6:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer interface > (org.apache.spark.serializer.Serializer) in your own third-party

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread Ted Yu
in, Atlas, > Ranger, Apache Infrastructure. There doesn't seem to be an option for me to > raise an issue for Spark?! > > Regards, > > James > > > On 4 March 2016 at 14:03, James Hammerton <ja...@gluru.co> wrote: > >> Sure thing, I'll see if I can isolate th

Re: how to implements a distributed system ?

2016-03-06 Thread Ted Yu
w.r.t. akka, please see the following: [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming There're various ways to design distributed system. Can you outline what your program does ? Cheers On Sun, Mar 6, 2016 at 8:35 AM, Minglei Zhang wrote: > hello, experts

Re: Streaming UI tab misleading for window operations

2016-03-06 Thread Ted Yu
Have you taken a look at SPARK-12739 ? FYI On Sun, Mar 6, 2016 at 4:06 AM, Jatin Kumar < jku...@rocketfuelinc.com.invalid> wrote: > Hello all, > > Consider following two code blocks: > > val ssc = new StreamingContext(sparkConfig, Seconds(2)) > val stream = KafkaUtils.createDirectStream(...) >

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Ted Yu
ds, > Gourav Sengupta > >> On Sun, Mar 6, 2016 at 11:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> Gourav: >> For the 3rd paragraph, did you mean the job seemed to be idle for about 5 >> minutes ? >> >> Cheers >> >>> On Mar 6, 2016, at

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Ted Yu
Gourav: For the 3rd paragraph, did you mean the job seemed to be idle for about 5 minutes ? Cheers > On Mar 6, 2016, at 3:35 AM, Gourav Sengupta wrote: > > Hi, > > This is a solved problem, try using s3a instead and everything will be fine. > > Besides that you

Re: How can I pass a Data Frame from object to another class

2016-03-05 Thread Ted Yu
Looking at the methods you call on HiveContext, they seem to belong to SQLContext. For SQLContext, you can use the below method of SQLContext in FirstQuery to retrieve SQLContext: def getOrCreate(sparkContext: SparkContext): SQLContext = { FYI On Sat, Mar 5, 2016 at 3:37 PM, Mich Talebzadeh

Re: Dynamic partitions reallocations with added worker nodes

2016-03-05 Thread Ted Yu
bq. I haven't added one more HDFS node to a hadoop cluster Does each of three nodes colocate with hdfs data nodes ? The absence of 4th data node might have something to do with the partition allocation. Can you show your code snippet ? Thanks On Sat, Mar 5, 2016 at 2:54 PM, Eugene Morozov

Re: Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

2016-03-05 Thread Ted Yu
bq. reportError("Exception while streaming travis", e) I assume there was none of the above in your job. What Spark release are you using ? Thanks On Sat, Mar 5, 2016 at 4:57 AM, Dominik Safaric wrote: > Dear all, > > Lately, as a part of a scientific

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
a member of Seq[(String, Int)] >> [error] val b = a.toDF("Name","score").registerTempTable("tmp") >> [error] ^ >> [error] >> /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:17: not >> found: value sql >> [error] sql("select

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
ain/scala/Sequence.scala:19: value > toDF is not a member of Seq[(String, Int)] > [error] a.toDF("Name","score").sort(desc("score")).show > [error] ^ > [error] three errors found > [error] (compile:compileIncremental) Compilation failed > [error

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
; toDF is not a member of Seq[(String, Int)] > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > >

Re: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Ted Yu
bq. However the method does not seem inherited to HiveContext. Can you clarify the above observation ? HiveContext extends SQLContext . On Fri, Mar 4, 2016 at 1:23 PM, jelez wrote: > What is the best approach to use getOrCreate for streaming job with > HiveContext. > It

Re: Error building a self contained Spark app

2016-03-04 Thread Ted Yu
Can you add the following into your code ? import sqlContext.implicits._ On Fri, Mar 4, 2016 at 1:14 PM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala program as below > > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Ted Yu
assLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 11 more > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https

Re: Issue with sbt failing on amplab succinct

2016-03-04 Thread Ted Yu
Can you show the complete stack trace ? It was clear which class whose definition was not found. On Fri, Mar 4, 2016 at 6:46 AM, Mich Talebzadeh wrote: > Hi, > > I have a simple Scala code that I want to use it in an sbt project. > > It is pretty simple but imports

Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread Ted Yu
Have you taken a look at https://parquet.apache.org/community/ ? On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi, > > I am exploring to use Apache Parquet with Spark SQL in our project. I > notice that Apache Parquet uses different encoding for

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Ted Yu
bq. that solved some problems Is there any problem that was not solved by the tweak ? Thanks On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi wrote: > You can limit the amount of memory spark will use for shuffle even in 1.6. > You can do that by tweaking the

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Ted Yu
bq. hConf.setBoolean("hbase.cluster.distributed", true) Not sure why the above is needed. If hbase-site.xml is on the classpath, it should contain the above setting already. FYI On Thu, Mar 3, 2016 at 6:08 AM, Ted Yu <yuzhih...@gmail.com> wrote: > From the log

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Ted Yu
From the log snippet you posted, it was not clear why connection got lost. You can lower the value for caching and see if GC activity gets lower. How wide are the rows in hbase table ? Thanks > On Mar 3, 2016, at 1:01 AM, Nirav Patel wrote: > > so why does

Re: Spark sql query taking long time

2016-03-02 Thread Ted Yu
Have you seen the thread 'Filter on a column having multiple values' where Michael gave this example ? https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/107522969592/2840265927289860/2388bac36e.html FYI On Wed, Mar 2, 2016 at

Re: Connect the two tables in spark sql

2016-03-01 Thread Ted Yu
You only showed one record from each table. Have you looked at the following method in DataFrame ? def unionAll(other: DataFrame): DataFrame = withPlan { On Tue, Mar 1, 2016 at 7:13 PM, Angel Angel wrote: > Hello Sir/Madam, > > I am using the spark sql for the data

Re: Does anyone have spark code style guide xml file ?

2016-03-01 Thread Ted Yu
See this in source repo: ./.idea/projectCodeStyle.xml On Tue, Mar 1, 2016 at 6:55 PM, zml张明磊 wrote: > Hello, > > > > Appreciate if you have xml file with the following style code ? > > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide > > > >

Re: Spark executor killed without apparent reason

2016-03-01 Thread Ted Yu
Using pastebin seems to be better. The attachment may not go through. FYI On Tue, Mar 1, 2016 at 6:07 PM, Jeff Zhang wrote: > Do you mind to attach the whole yarn app log ? > > On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel > wrote: > >> Hi Ryan, >> >>

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Ted Yu
Do you mind pastebin'ning the stack trace with the error so that we know which part of the code is under discussion ? Thanks On Tue, Mar 1, 2016 at 7:48 AM, Peter Halliday wrote: > I have a Spark application that has a Task seem to fail, but it actually > did write out some

Re: local class incompatible: stream classdesc

2016-03-01 Thread Ted Yu
RDD serialized by one release of Spark is not guaranteed to be readable by another release of Spark. Please check whether there are mixed Spark versions. FYI: http://stackoverflow.com/questions/10378855/java-io-invalidclassexception-local-class-incompatible On Tue, Mar 1, 2016 at 7:35 AM,

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-03-01 Thread Ted Yu
component being built with different release of hbase. Try setting "hbase.defaults.for.version.skip" to true. Cheers On Mon, Feb 29, 2016 at 9:12 PM, Ted Yu <yuzhih...@gmail.com> wrote: > 16/02/29 23:09:34 INFO ZooKeeper: Initiating client connection, > connec

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Ted Yu
inaccessible to your Spark job. Please add it in your classpath. On Mon, Feb 29, 2016 at 8:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: > 16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to server > localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using >

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-02-29 Thread Ted Yu
16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) Is your cluster secure cluster ? bq. Trace : Was there any output after 'Trace :' ? Was hbase-site.xml accessible to your Spark job

Re: [Error]: Spark 1.5.2 + HiveHbase Integration

2016-02-29 Thread Ted Yu
Divya: Please try not to cross post your question. In your case HBase-common jar is needed. To find all the hbase jars needed, you can run 'mvn dependency:tree' and check its output. > On Feb 29, 2016, at 1:48 AM, Divya Gehlot wrote: > > Hi, > I am trying to access

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ted Yu
http://www.amazon.com/Scala-Spark-Alexy-Khrabrov/dp/1491929286/ref=sr_1_1?ie=UTF8=1456696284=8-1=spark+dataframe There is another one from Wiley (to be published on March 21): "Spark: Big Data Cluster Computing in Production," written by Ilya Ganelin, Brennon York, Kai Sasaki, and Ema Orhian On

Re: Hbase in spark

2016-02-26 Thread Ted Yu
ase module only but the problem > is when I do the bulk load it shows data skew and takes time to create the > hfile. > On 26 Feb 2016 10:25 p.m., "Ted Yu" <yuzhih...@gmail.com> wrote: > >> In hbase, there is hbase-spark module which supports bulk load. >>

Re: Hbase in spark

2016-02-26 Thread Ted Yu
In hbase, there is hbase-spark module which supports bulk load. This module is to be backported in the upcoming 1.3.0 release. There is some pending work, such as HBASE-15271 . FYI On Fri, Feb 26, 2016 at 8:50 AM, Renu Yadav wrote: > Has anybody implemented bulk load into

Re: How to get progress information of an RDD operation

2016-02-23 Thread Ted Yu
I think Ningjun was looking for programmatic way of tracking progress. I took a look at: ./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala but there doesn't seem to exist fine grained events directly reflecting what Ningjun looks for. On Tue, Feb 23, 2016 at 11:24 AM, Kevin

Re: Percentile calculation in spark 1.6

2016-02-23 Thread Ted Yu
Please take a look at the following if you can utilize Hive hdf: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUdfSuite.scala On Tue, Feb 23, 2016 at 6:28 AM, Chandeep Singh wrote: > This should help - >

Re: Read from kafka after application is restarted

2016-02-23 Thread Ted Yu
For receiver approach, have you tried Ryan's workaround ? Btw I don't see the errors you faced because there was no attachment. > On Feb 23, 2016, at 3:39 AM, vaibhavrtk1 wrote: > > Hello > > I have tried with Direct API but i am getting this an error, which is

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Ted Yu
Which line is line 42 in your code ? When variable lines becomes empty, you can stop your program. Cheers > On Feb 23, 2016, at 12:25 AM, Femi Anthony wrote: > > I am working on Spark Streaming API and I wish to stream a set of > pre-downloaded web log files continuously

Re: spark 1.6 Not able to start spark

2016-02-22 Thread Ted Yu
Which Hadoop release did you build Spark against ? Can you give the full stack trace ? > On Feb 22, 2016, at 9:38 PM, Arunkumar Pillai wrote: > > Hi When i try to start spark-shell > I'm getting following error > > > Exception in thread "main"

Re: Using functional programming rather than SQL

2016-02-22 Thread Ted Yu
Mich: Please refer to the following test suite for examples on various DataFrame operations: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Mon, Feb 22, 2016 at 4:39 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Thanks Dean. > > I gather if

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Ted Yu
} >>> else{ >>> return new Tuple2<>(key, null); >>> } >>> } >>> else{ >>>

Re: Spark Cache Eviction

2016-02-22 Thread Ted Yu
Please see SPARK-1762 Add functionality to pin RDDs in cache On Mon, Feb 22, 2016 at 6:43 AM, Pietro Gentile < pietro.gentile89.develo...@gmail.com> wrote: > Hi all, > > Is there a way to prevent eviction of the RDD from SparkContext ? > I would not use the cache with its default behavior

Re: 回复: a new FileFormat 5x~100x faster than parquet

2016-02-22 Thread Ted Yu
The referenced benchmark is in Chinese. Please provide English version so that more people can understand. For item 7, looks like the speed of ingest is much slower compared to using Parquet. Cheers On Mon, Feb 22, 2016 at 6:12 AM, 开心延年 wrote: > 1.ya100 is not only the

Re: Evaluating spark streaming use case

2016-02-21 Thread Ted Yu
rflow workflow scheduler: > https://github.com/fluxcapacitor/pipeline/wiki > > my advice with spark streaming is to get the data out of spark streaming > as quickly as possible - and into a more durable format more suitable for > aggregation and compute. > > this greatly simplif

Re: Evaluating spark streaming use case

2016-02-21 Thread Ted Yu
w.r.t. cleaner TTL, please see: [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 FYI On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas wrote: > > It sounds like another window operation on top of the 30-min window will > achieve the desired objective. > Just

Re: RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Ted Yu
I tried the following in spark-shell: scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") df0: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields] scala> val idList = List("1", "2", "3") idList: List[String] = List(1, 2, 3) scala> val

Re: Element appear in both 2 splits of RDD after using randomSplit

2016-02-20 Thread Ted Yu
Have you looked at: SPARK-12662 Fix DataFrame.randomSplit to avoid creating overlapping splits Cheers On Sat, Feb 20, 2016 at 7:01 PM, tuan3w wrote: > I'm training a model using MLLib. When I try to split data into training > and > test data, I found a weird problem. I

Re: Checking for null values when mapping

2016-02-20 Thread Ted Yu
For #2, you can filter out row whose first column has length 0. Cheers > On Feb 20, 2016, at 6:59 AM, Mich Talebzadeh wrote: > > Thanks > > > So what I did was > > scala> val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >

Re: Communication between two spark streaming Job

2016-02-19 Thread Ted Yu
Have you considered using a Key Value store which is accessible to both jobs ? The communication would take place through this store. Cheers On Fri, Feb 19, 2016 at 11:48 AM, Ashish Soni wrote: > Hi , > > Is there any way we can communicate across two different spark

Re: Submitting Jobs Programmatically

2016-02-19 Thread Ted Yu
on windows. > > My default the start-all.sh doesn't work and I don't see anything in > localhos:8080 > > I will do some more investigation and come back. > > Thanks again for all your help! > > Thanks & regards > Arko > > > On Fri, Feb 19, 2016 at 6:35

Re: Submitting Jobs Programmatically

2016-02-19 Thread Ted Yu
Please see https://spark.apache.org/docs/latest/spark-standalone.html On Fri, Feb 19, 2016 at 6:27 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hi, > > Thanks for your response, that really helped. > > However, I don't believe the job is being submitted. When I run spark >

Re: How to get the code for class in spark

2016-02-19 Thread Ted Yu
Can you clarify your question ? Did you mean the body of your class ? > On Feb 19, 2016, at 4:43 AM, Ashok Kumar wrote: > > Hi, > > If I define a class in Scala like > > case class(col1: String, col2:Int,...) > > and it is created how would I be able to see its

Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Ted Yu
Is it possible to perform the tests using Spark 1.6.0 ? Thanks On Thu, Feb 18, 2016 at 9:51 PM, Prabhu Joseph wrote: > Hi All, > >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a > single Spark Context, the jobs take more time to complete

Re: UDAF support for DataFrames in Spark 1.5.0?

2016-02-18 Thread Ted Yu
Richard: Please see SPARK-9664 Use sqlContext.udf to register UDAFs Cheers On Thu, Feb 18, 2016 at 3:18 PM, Kabeer Ahmed wrote: > I use Spark 1.5 with CDH5.5 distribution and I see that support is present > for UDAF. From the link: >

Re: Is this likely to cause any problems?

2016-02-18 Thread Ted Yu
pplin notebook > if they do some port scanning... > > 2016-02-18 15:04 GMT+01:00 Gourav Sengupta <gourav.sengu...@gmail.com>: > >> Hi, >> >> Just out of sheet curiosity why are you not using EMR to start your SPARK >> cluster? >> >> >> Regard

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread Ted Yu
bq. streamingContext.remember("duration") did not help Can you give a bit more detail on the above ? Did you mean the job encountered OOME later on ? Which Spark release are you using ? Cheers On Wed, Feb 17, 2016 at 6:03 PM, ramach1776 wrote: > We have a streaming

Re: Is this likely to cause any problems?

2016-02-18 Thread Ted Yu
Have you seen this ? HADOOP-10988 Cheers On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton wrote: > HI, > > I am seeing warnings like this in the logs when I run Spark jobs: > > OpenJDK 64-Bit Server VM warning: You have loaded library >

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Ted Yu
idea as to when this will be released? > > Thanks, > Ben > > > On Feb 17, 2016, at 2:53 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > The HBASE JIRA below is for HBase 2.0 > > HBase Spark module would be back ported to hbase 1.3.0 > > FYI > > On Feb 17, 2016,

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Ted Yu
The HBASE JIRA below is for HBase 2.0 HBase Spark module would be back ported to hbase 1.3.0 FYI > On Feb 17, 2016, at 1:13 PM, Chandeep Singh wrote: > > HBase-Spark module was added in 1.3 > > https://issues.apache.org/jira/browse/HBASE-13992 > >

Re: Running multiple foreach loops

2016-02-17 Thread Ted Yu
If the Accumulators are updated at the same time, calling foreach() once seems to have better performance. > On Feb 17, 2016, at 4:30 PM, Daniel Imberman > wrote: > > Hi all, > > So I'm currently figuring out how to accumulate three separate accumulators: > > val

Re: Side effects of using var inside a class object in a Rdd

2016-02-16 Thread Ted Yu
on){ >obj.g = calculateE(e,f) > } > obj > ) > > > So I created 1 class with all variables, and then trying to update fields > of the same class. > > On Tue, Feb 16, 2016 at 11:38 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Age can be computed fr

Re: Unusually large deserialisation time

2016-02-16 Thread Ted Yu
Darren: Can you post link to the deadlock issue you mentioned ? Thanks > On Feb 16, 2016, at 6:55 AM, Darren Govoni wrote: > > I think this is part of the bigger issue of serious deadlock conditions > occurring in spark many of us have posted on. > > Would the task in

Re: Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread Ted Yu
to fix the issue > > Regards, > Satish Chandra > > > > > > On Mon, Feb 15, 2016 at 7:41 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> fill() was introduced in 1.3.1 >> >> Can you show code snippet which reproduces the error ? >&

Re: Side effects of using var inside a class object in a Rdd

2016-02-15 Thread Ted Yu
Age can be computed from the birthdate. Looks like it doesn't need to be a member of Animal class. If age is just for illustration, can you give an example which better mimics the scenario you work on ? Cheers On Mon, Feb 15, 2016 at 8:53 PM, Hemalatha A < hemalatha.amru...@googlemail.com>

Re: IllegalArgumentException UnsatisfiedLinkError snappy-1.1.2 spark-shell error

2016-02-15 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtW43zT1e2nfb=Re+ibsnappyjava+so+failed+to+map+segment+from+shared+object On Mon, Feb 15, 2016 at 7:09 PM, Paolo Villaflores wrote: > > Hi, > > > > I am trying to run spark 1.6.0. > > I have previously just

Re: which is better RDD or Dataframe?

2016-02-15 Thread Ted Yu
Can you describe the types of query you want to perform ? If you don't already have a data flow which is optimized for RDD, I would suggest using Dataframe API (or event DataSet API) which gives optimizer more room. Cheers On Mon, Feb 15, 2016 at 6:43 PM, Divya Gehlot

Re: How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ted Yu
cala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > thanks > > > > > On Tuesday, 16 February 2016, 1:33, Ted Yu <yuzhih...@gmail.com> wrote: > > > Here is the path to the examples jar in 1.6.0 release: > > ./lib/spark-examples-1.6.0-hado

Re: How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ted Yu
If you don't modify HdfsTest.scala, there is no need to rebuild it - it is contained in the examples jar coming with Spark release. You can use spark-submit to run the example. Cheers On Mon, Feb 15, 2016 at 5:24 PM, Ashok Kumar wrote: > Gurus, > > I am trying to

Re: Passing multiple jar files to spark-shell

2016-02-15 Thread Ted Yu
Mich: You can pass jars for driver using: spark.driver.extraClassPath Cheers On Mon, Feb 15, 2016 at 1:05 AM, Mich Talebzadeh wrote: > Thanks Deng. Unfortunately it seems that it looks for driver-class-path as > well L > > > > For example with –jars alone I get > > > >

Re: Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread Ted Yu
fill() was introduced in 1.3.1 Can you show code snippet which reproduces the error ? I tried the following using spark-shell on master branch: scala> df.na.fill(0) res0: org.apache.spark.sql.DataFrame = [col: int] Cheers On Mon, Feb 15, 2016 at 3:36 AM, satish chandra j

Re: How to join an RDD with a hive table?

2016-02-15 Thread Ted Yu
Have you tried creating a DataFrame from the RDD and join with DataFrame which corresponds to the hive table ? On Sun, Feb 14, 2016 at 9:53 PM, SRK wrote: > Hi, > > How to join an RDD with a hive table and retrieve only the records that I > am > interested. Suppose, I

Re: Best way to bring up Spark with Cassandra (and Elasticsearch) in production.

2016-02-15 Thread Ted Yu
Sounds reasonable. Please consider posting question on Spark C* connector on their mailing list if you have any. On Sun, Feb 14, 2016 at 7:51 PM, Kevin Burton wrote: > Afternoon. > > About 6 months ago I tried (and failed) to get Spark and Cassandra working > together in

Re: Unable to insert overwrite table with Spark 1.5.2

2016-02-15 Thread Ted Yu
Do you mind trying Spark 1.6.0 ? As far as I can tell, 'Cannot overwrite table' exception may only occur for CreateTableUsingAsSelect when source and dest relations refer to the same table in branch-1.6 Cheers On Sun, Feb 14, 2016 at 9:29 PM, Ramanathan R wrote: > Hi

Re: Scala types to StructType

2016-02-15 Thread Ted Yu
an wrote: > > Right, Thanks Ted. > > On Fri, Feb 12, 2016 at 10:21 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Minor correction: the class is CatalystTypeConverters.scala >> >> On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan < >> <ymaha...@snap

Re: new to Spark - trying to get a basic example to run - could use some help

2016-02-13 Thread Ted Yu
Maybe a comment should be added to SparkPi.scala, telling user to look for the value in stdout log ? Cheers On Sat, Feb 13, 2016 at 3:12 AM, Chandeep Singh wrote: > Try looking at stdout logs. I ran the exactly same job as you and did not > see anything on the console

Re: Unrecognized VM option 'MaxPermSize=512M'

2016-02-13 Thread Ted Yu
I have the following for my shell: export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" How do you specify MAVEN_OPTS ? Which version of Java / maven do you use ? Cheers On Sat, Feb 13, 2016 at 7:34 AM, Milad khajavi wrote: > Hello, > When I want

Re: using udf to convert Oracle number column in Data Frame

2016-02-13 Thread Ted Yu
Please take a look at sql/core/src/main/scala/org/apache/spark/sql/functions.scala : def udf(f: AnyRef, dataType: DataType): UserDefinedFunction = { UserDefinedFunction(f, dataType, None) And sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala : test("udf") { val foo =

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Ted Yu
mapWithState supports checkpoint. There has been some bug fix since release of 1.6.0 e.g. SPARK-12591 NullPointerException using checkpointed mapWithState with KryoSerializer which is in the upcoming 1.6.1 Cheers On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand

Re: off-heap certain operations

2016-02-12 Thread Ted Yu
Ovidiu-Cristian: Please see the following JIRA / PR : [SPARK-12251] Document and improve off-heap memory configurations Cheers On Thu, Feb 11, 2016 at 11:06 PM, Sea <261810...@qq.com> wrote: > spark.memory.offHeap.enabled (default is false) , it is wrong in spark > docs. Spark1.6 do not

Re: Inserting column to DataFrame

2016-02-12 Thread Ted Yu
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > > Regards, > > Zsolt > > > 2016-02-12 13:11 GMT+01:00 Ted Yu <yuzhih...@gmail.com>: > >>

Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Ted Yu
t;> pairs = lines.map(lambda x: (x, 1)) >> counts = pairs.reduceByKey(lambda a, b: a + b) >> counts.collect() >> ``` >> >> On Fri, Feb 12, 2016 at 4:26 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Can you give a bit more information ? >&g

Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Ted Yu
Can you give a bit more information ? release of Spark you use full error trace your code snippet Thanks On Fri, Feb 12, 2016 at 7:22 AM, Sisyphuss wrote: > When trying the `reduceByKey` transformation on Python3.4, I got the > following error: > > ImportError: No

Re: Spark Submit

2016-02-12 Thread Ted Yu
Have you tried specifying multiple '--conf key=value' ? Cheers On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni wrote: > Hi All , > > How do i pass multiple configuration parameter while spark submit > > Please help i am trying as below > > spark-submit --conf

Re: spark shell ini file

2016-02-11 Thread Ted Yu
Please see: [SPARK-13086][SHELL] Use the Scala REPL settings, to enable things like `-i file` On Thu, Feb 11, 2016 at 1:45 AM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Hi, > > > > in Hive one can use -I parameter to preload certain setting into the > beeline

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
I think SPARK_CLASSPATH is deprecated. Can you show the command line launching your Spark job ? Which Spark release do you use ? Thanks On Thu, Feb 11, 2016 at 5:38 PM, Charlie Wright wrote: > built and installed hadoop with: > mvn package -Pdist -DskipTests -Dtar >

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
The Spark driver does not run on the YARN cluster in client mode, only the Spark executors do. Can you check YARN logs for the failed job to see if there was more clue ? Does the YARN cluster run the customized hadoop or stock hadoop ? Cheers On Thu, Feb 11, 2016 at 5:44 PM, Charlie Wright

Re: Scala types to StructType

2016-02-11 Thread Ted Yu
Minor correction: the class is CatalystTypeConverters.scala On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan wrote: > CatatlystTypeConverters.scala has all types of utility methods to convert > from Scala to row and vice a versa. > > > On Fri, Feb 12, 2016 at 12:21 AM,

Re: spark thrift server transport protocol

2016-02-11 Thread Ted Yu
>From the head of HiveThriftServer2 : * The main entry point for the Spark SQL port of HiveServer2. Starts up a `SparkSQLContext` and a * `HiveThriftServer2` thrift server. Looking at HiveServer2.java from Hive, looks like it uses thrift protocol. FYI On Thu, Feb 11, 2016 at 9:34 AM,

Re: Dataframes

2016-02-11 Thread Ted Yu
bq. Whether sContext(SQlCOntext) will help to query in both the dataframes and will it decide on which dataframe to query for . Can you clarify what you were asking ? The queries would be carried out on respective DataFrame's as shown in your snippet. On Thu, Feb 11, 2016 at 8:47 AM, Gaurav

Re: Dataset joinWith condition

2016-02-10 Thread Ted Yu
uot; and it gives an error message that a TypedColumn > is expected. > > Regards, > Raghava. > > > On Tue, Feb 9, 2016 at 10:12 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Please take a look at: >> sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala &g

Re: RDD distribution

2016-02-10 Thread Ted Yu
What Partitioner do you use ? Have you tried using RangePartitioner ? Cheers On Wed, Feb 10, 2016 at 3:54 PM, daze5112 wrote: > Hi im trying to improve the performance of some code im running but have > noticed that my distribution of my RDD across executors isn't

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Ted Yu
Have you tried adding hbase client jars to spark.executor.extraClassPath ? Cheers On Wed, Feb 10, 2016 at 12:17 AM, Prabhu Joseph wrote: > + Spark-Dev > > For a Spark job on YARN accessing hbase table, added all hbase client jars > into spark.yarn.dist.files,

<    1   2   3   4   5   6   7   8   9   10   >