Re: Extending Spark REST API

2016-03-24 Thread Ted Yu
bq. getServletHandlers is not intended for public use >From MetricsSystem.scala : private[spark] class MetricsSystem private ( Looks like there is no easy way to extend REST API. On Thu, Mar 24, 2016 at 1:09 PM, Sebastian Kochman < sebastian.koch...@outlook.com> wrote: > Hello, > I have a ques

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-24 Thread Ted Yu
checkpointing instead of saving still wouldn't > execute any action on the RDD -- it would just mark the point at which > checkpointing should be done when an action is eventually run. > > On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu wrote: > >> bq. when I get the last RDD >

Re: Best way to determine # of workers

2016-03-25 Thread Ted Yu
Here is the doc for defaultParallelism : /** Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */ def defaultParallelism: Int = { What if the user changes parallelism ? Cheers On Fri, Mar 25, 2016 at 5:33 AM, manasdebashiskar wrote: > There is a sc

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
Looks like you forgot an import for Date. FYI On Fri, Mar 25, 2016 at 7:36 AM, Mich Talebzadeh wrote: > > > Hi, > > writing a UDF to convert a string into Date > > def ChangeDate(word : String) : Date = { > | return > TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word),"dd/MM/"),"-MM-dd")

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
Do you mind showing body of TO_DATE() ? Thanks On Fri, Mar 25, 2016 at 7:38 AM, Ted Yu wrote: > Looks like you forgot an import for Date. > > FYI > > On Fri, Mar 25, 2016 at 7:36 AM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> >> >>

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Ted Yu
This is the original subject of the JIRA: Partition discovery fail if there is a _SUCCESS file in the table's root dir If I remember correctly, there were discussions on how (traditional) partition discovery slowed down Spark jobs. Cheers On Fri, Mar 25, 2016 at 10:15 AM, suresk wrote: > In pr

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
d6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 25 March 2016 at 14:54, Ted Yu wrote: > >> Do you mind showing body of TO_DATE() ? >> &g

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtAvwgE7dEI02 On Fri, Mar 25, 2016 at 10:39 AM, prateek arora wrote: > Hi > > I want to submit spark application from outside of spark clusters . so > please help me to provide a information regarding this. > > Regards > Prateek > > > > > -- > Vi

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Ted Yu
> I have one more question .. if i want to launch a spark application in > production environment so is there any other way so multiple users can > submit there job without having hadoop configuration . > > Regards > Prateek > > > On Fri, Mar 25, 2016 at 10:50 AM, Ted Yu

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
Which release of Spark do you use, Mich ? In master branch, the message is more accurate (sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala): override def getMessage: String = s"Table $table not found in database $db" On Fri, Mar 25, 2016 at 3:21 PM,

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
able > info OK > > HTH > > > On Friday, 25 March 2016, 22:32, Ted Yu wrote: > > > Which release of Spark do you use, Mich ? > > In master branch, the message is more accurate > (sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
/www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 25 March 2016 at 22:40, Ted Yu wrote: > >> Looks like database support was fixed by: >> >> [SPARK-7943] [SPARK-8105] [SPAR

Re: Hive table created by Spark seems to end up in default

2016-03-25 Thread Ted Yu
Session management has improved in 1.6.x (see SPARK-10810) Mind giving 1.6.1 a try ? Thanks On Fri, Mar 25, 2016 at 3:48 PM, Mich Talebzadeh wrote: > I have noticed that the only sure way to specify a Hive table from Spark > is to prefix it with database (DBName) name otherwise it seems to be

Re: Is this expected in Spark 1.6.1, derby.log file created when spark shell starts

2016-03-26 Thread Ted Yu
Same with master branch. I found derby.log in the following two files: .gitignore:derby.log dev/.rat-excludes:derby.log FYI On Sat, Mar 26, 2016 at 4:09 AM, Mich Talebzadeh wrote: > Having moved to Spark 1.6.1, I have noticed thar whenerver I start a > spark-sql or shell. a dervy.log file is

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-26 Thread Ted Yu
park%20and%20Scala.pdf > > > -- Forwarded message -- > From: Ted Yu > Date: 26 March 2016 at 12:51 > Subject: Re: Any plans to migrate Transformer API to Spark SQL (closer to > DataFrames)? > To: Michał Zieliński > > > Michal: > Can you share the sli

Re: whether a certain piece can be assigned to a specicified node by some codes in my program.

2016-03-26 Thread Ted Yu
Please take a look at the following method: /** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */ final def preferredLocations(split: Partition): Seq[String] = { checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {

Re: Hive on Spark engine

2016-03-26 Thread Ted Yu
According to: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_HDP_RelNotes/bk_HDP_RelNotes-20151221.pdf Spark 1.5.2 comes out of box. Suggest moving questions on HDP to Hortonworks forum. Cheers On Sat, Mar 26, 2016 at 3:32 PM, Mich Talebzadeh wrote: > Thanks Jorn. > > Just to be

Re: whether a certain piece can be assigned to a specicified node by some codes in my program.

2016-03-27 Thread Ted Yu
Please take a look at the MyRDD class in: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala There is scaladoc for the class. See how getPreferredLocations() is implemented. Cheers On Sun, Mar 27, 2016 at 2:01 AM, chenyong wrote: > Thank you Ted for your reply. > > Your ex

Re: Custom RDD in spark, cannot find custom method

2016-03-27 Thread Ted Yu
Can you show the full stack trace (or top 10 lines) and the snippet using your MyRDD ? Thanks On Sun, Mar 27, 2016 at 9:22 AM, Tenghuan He wrote: > ​Hi everyone, > > I am creating a custom RDD which extends RDD and add a custom method, > however the custom method cannot be found. > The

Re: Custom RDD in spark, cannot find custom method

2016-03-27 Thread Ted Yu
ror: value customMethod is not a member of > org.apache.spark.rdd.RDD[(Int, String)]* > > and the customable method in PairRDDFunctions.scala is > > def customable(partitioner: Partitioner): RDD[(K, V)] = self.withScope { > new MyRDD[K, V](self, partitioner) > } > >

Re: Custom RDD in spark, cannot find custom method

2016-03-27 Thread Ted Yu
d.RDD[(Int, String)] = MyRDD[3]* at > customable at > 5 :28 > 6 scala> *myrdd.customMethod(bulk)* > *7 error: value customMethod is not a member of > org.apache.spark.rdd.RDD[(Int, String)]* > > On Mon, Mar 28, 2016 at 12:50 AM, Ted Yu wrote: > >> bq. def cus

Re: Custom RDD in spark, cannot find custom method

2016-03-28 Thread Ted Yu
oject then the custom method can be called in the main function and it >> works. >> I misunderstand the usage of custom rdd, the custom rdd does not have to be >> written to the spark project like UnionRDD, CogroupedRDD, and just add it to >> your own project. >>

Re: Aggregate subsequenty x row values together.

2016-03-28 Thread Ted Yu
Can you describe your use case a bit more ? Since the row keys are not sorted in your example, there is a chance that you get indeterministic results when you aggregate on groups of two successive rows. Thanks On Mon, Mar 28, 2016 at 9:21 AM, sujeet jog wrote: > Hi, > > I have a RDD like this

Re: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray error in nelwy build Hbase

2016-03-28 Thread Ted Yu
Dropping dev@ Can you provide a bit more information ? release of hbase release of hadoop I assume you're running on Linux. Any change in Linux setup before the exception showed up ? On Mon, Mar 28, 2016 at 10:30 AM, beeshma r wrote: > Hi > i am testing with newly build Hbase .Initially tab

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Ted Yu
Can you describe what gets triggered by triggerAndWait ? Cheers On Mon, Mar 28, 2016 at 1:39 PM, kpeng1 wrote: > Hi All, > > I am currently trying to debug a spark application written in scala. I > have > a main method: > def main(args: Array[String]) { > ... > SocialUtil.trigge

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Ted Yu
See this method: lazy val rdd: RDD[T] = { On Mon, Mar 28, 2016 at 6:30 PM, Russell Jurney wrote: > Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. This > seems related to DataFrames. Is there a way to convert a DataFrame's RDD to > a 'normal' RDD? > > > On Mon, Mar 28, 2

Re: How to reduce the Executor Computing Time.

2016-03-29 Thread Ted Yu
Can you disclose snippet of your code ? Which Spark release do you use ? Thanks > On Mar 29, 2016, at 3:42 AM, Charan Adabala wrote: > > From the below image how can we reduce the computing time for the stages, at > some stages the Executor Computing Time is less than 1 sec and some are > cons

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Ted Yu
As the error said, com.sap.db.jdbc.topology.Host is not serializable. Maybe post question on Sap Hana mailing list (if any) ? On Tue, Mar 29, 2016 at 7:54 AM, reena upadhyay < reena.upadh...@impetus.co.in> wrote: > I am trying to execute query using spark sql on SAP HANA from spark > shell. I >

Re: Unable to set cores while submitting Spark job

2016-03-30 Thread Ted Yu
-c CORES, --cores CORES Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker bq. sc.getConf().set() I think you should use this pattern (shown in https://spark.apache.org/docs/latest/spark-standalone.html): val conf = new SparkConf()

Re: spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-03-30 Thread Ted Yu
Have you tried the following construct ? new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey() See core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala On Wed, Mar 30, 2016 at 5:20 AM, Nirav Patel wrote: > Hi, I am trying to use filterByRange feature of spark OrderedRDDFunctions >

Re: Loading multiple packages while starting spark-shell

2016-03-30 Thread Ted Yu
How did you specify the packages ? See the following from https://spark.apache.org/docs/latest/submitting-applications.html : Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. On Wed, Mar 30, 2016 at 7:15 AM, Mustafa Elbehery

Re: Unable to Run Spark Streaming Job in Hadoop YARN mode

2016-03-31 Thread Ted Yu
Looking through https://spark.apache.org/docs/latest/configuration.html#spark-streaming , I don't see config specific to YARN. Can you pastebin the exception you saw ? When the job stopped, was there any error ? Thanks On Wed, Mar 30, 2016 at 10:57 PM, Soni spark wrote: > Hi All, > > I am una

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Ted Yu
I tried this: scala> final case class Text(id: Int, text: String) warning: there was one unchecked warning; re-run with -unchecked for details defined class Text scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text] ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Spark 1.6.1 uses this version of jackson: 2.4.4 Looks like Tranquility uses different version of jackson. How do you build your jar ? Consider using maven-shade-plugin to resolve the conflict if you use maven. Cheers On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa wrote: > Hi, list. > >

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Please exclude jackson-databind - that was where the AnnotationMap class comes from. On Thu, Mar 31, 2016 at 11:37 AM, Marcelo Oikawa < marcelo.oik...@webradar.com> wrote: > Hi, Alonso. > > As you can see jackson-core is provided by several libraries, try to >> exclude it from spark-core, i think

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Ted Yu
Can you show the stack trace ? The log message came from DiskBlockObjectWriter#revertPartialWritesAndClose(). Unfortunately, the method doesn't throw exception, making it a bit hard for caller to know of the disk full condition. On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand wrote: > > Hi, >

Re: [SQL] A bug with withColumn?

2016-03-31 Thread Ted Yu
Looks like this is result of the following check: val shouldReplace = output.exists(f => resolver(f.name, colName)) if (shouldReplace) { where existing column, text, was replaced. On Thu, Mar 31, 2016 at 12:08 PM, Jacek Laskowski wrote: > Hi, > > Just ran into the following. Is this a

Re: Thread-safety of a SparkListener

2016-04-01 Thread Ted Yu
In general, you should implement thread-safety in your code. Which set of events are you interested in ? Cheers On Fri, Apr 1, 2016 at 9:23 AM, Truong Duc Kien wrote: > Hi, > > I need to gather some metrics using a SparkListener. Does the callback > methods need to thread-safe or they are alwa

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
You can set them in spark-defaults.conf See also https://spark.apache.org/docs/latest/configuration.html#spark-ui On Fri, Apr 1, 2016 at 8:26 AM, Max Schmidt wrote: > Can somebody tell me the interaction between the properties: > > spark.ui.retainedJobs > spark.ui.retainedStages > spark.history

Re: OutOfMemory with wide (289 column) dataframe

2016-04-01 Thread Ted Yu
bq. This was a big help! The email (maybe only addressed to you) didn't come with your latest reply. Do you mind sharing it ? Thanks On Fri, Apr 1, 2016 at 11:37 AM, ludflu wrote: > This was a big help! For the benefit of my fellow travelers running spark > on > EMR: > > I made a json file wi

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
hem for the history-server? The daemon? The workers? > > And what if I use the java API instead of spark-submit for the jobs? > > I guess that the spark-defaults.conf are obsolete for the java API? > > > Am 2016-04-01 18:58, schrieb Ted Yu: > >> You can set them in

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
Assuming your code is written in Scala, I would suggest using ScalaTest. Please take a look at the XXSuite.scala files under mllib/ On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman wrote: > Hello, > > I have a code written in scala using Mllib. I want to perform unit testing > it. I cant decide

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
quot;1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )* > > > > > On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu wrote: > >> Assuming your code is written in Scala, I would suggest using ScalaTest. >> >> Please take a look at t

Re: Problem with jackson lib running on spark

2016-04-01 Thread Ted Yu
Thanks for sharing the workaround. Probably send a PR on tranquilizer github :-) On Fri, Apr 1, 2016 at 12:50 PM, Marcelo Oikawa wrote: > Hi, list. > > Just to close the thread. Unfortunately, I didnt solve the jackson lib > problem but I did a workaround that works fine for me. Perhaps this he

Re: Scala: Perform Unit Testing in spark

2016-04-02 Thread Ted Yu
; When I added *"org.apache.spark" % "spark-core_2.10" % "1.6.0", *it > should include spark-core_2.10-1.6.1-tests.jar. > Why do I need to use the jar file explicitly? > > And how do I use the jars for compiling with *sbt* and running the tests > on

Re: Multiple lookups; consolidate result and run further aggregations

2016-04-02 Thread Ted Yu
Looking at the implementation for lookup in PairRDDFunctions, I think your understanding is correct. On Sat, Apr 2, 2016 at 3:16 AM, Nirav Patel wrote: > I will start by question: Is spark lookup function on pair rdd is a driver > action. ie result is returned to driver? > > I have list of Keys

Re: multiple splits fails

2016-04-03 Thread Ted Yu
bq. split"\t," splits the filter by carriage return Minor correction: "\t" denotes tab character. On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas wrote: > Hi Mich, > > 1. The first underscore in your filter call is refering to a line in the > file (as textFile() results in a collection of strings)

Re: multiple splits fails

2016-04-03 Thread Ted Yu
showlines = messages.filter(_ contains("ASE 15")).filter(_ > contains("UPDATE INDEX STATISTICS")).flatMap(line => > line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + > _).collect.foreach(println) > > > How does one refer to the conten

Re: multiple splits fails

2016-04-03 Thread Ted Yu
l v = lines.filter(_.contains("ASE 15")).filter(_ >> contains("UPDATE INDEX STATISTICS")).flatMap(line => >> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + >> _).collect.foreach(println) >> >> >> Dr Mich Talebzadeh >

Re: multiple splits fails

2016-04-03 Thread Ted Yu
terialize all rows? > > Cheers > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://tal

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread Ted Yu
bq. I'm on version 2.10 for spark The above is Scala version. Can you give us the Spark version ? Thanks On Mon, Apr 4, 2016 at 2:36 PM, mpawashe wrote: > Hi all, > > I am using Spark Streaming API (I'm on version 2.10 for spark and > streaming), and I am running into a function serialization

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Ted Yu
Did you define idxmax() method yourself ? Thanks On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel wrote: > Hello, > > i am writing one spark application i which i need the index of the maximum > element. > > My table has one column only and i want the index of the maximum element. > > MAX(count) > 2

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Ted Yu
The error was due to REPL expecting an integer (index to the Array) whereas "MAX(count)" was a String. What do you want to achieve ? On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel wrote: > Hello, > > i am writing one spark application i which i need the index of the maximum > element. > > My table

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Ted Yu
Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn't this too low for your use case ? Cheers On Wed, Apr 6, 2016 at 9:19 AM, Peter Rudenko wrote: > Hi i have a situation, say i have a yarn cluster with 2GB RAM. I'm > submitting 2 spark jobs with "

Re: how to query the number of running executors?

2016-04-06 Thread Ted Yu
Have you looked at SparkListener ? /** * Called when the driver registers a new executor. */ def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit /** * Called when the driver removes an executor. */ def onExecutorRemoved(executorRemoved: SparkListenerExecutorRe

Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Ted Yu
This is the version of Kafka Spark depends on: [INFO] +- org.apache.kafka:kafka_2.10:jar:0.8.2.1:compile On Thu, Apr 7, 2016 at 9:14 AM, Haroon Rasheed wrote: > Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0" > compile. I guess the internal dependencies are automatic

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread Ted Yu
Which Spark release are you using ? Have you registered to all the events provided by SparkListener ? If so, can you do event-wise summation of execution time ? Thanks On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge wrote: > We are running a batch job with the following specifications > •

Re: can not join dataset with itself

2016-04-08 Thread Ted Yu
Looks like you're using Spark 1.6.x What error(s) did you get for the first two joins ? Thanks On Fri, Apr 8, 2016 at 3:53 AM, JH P wrote: > Hi. I want a dataset join with itself. So i tried below codes. > > 1. newGnsDS.joinWith(newGnsDS, $"dataType”) > > 2. newGnsDS.as("a").joinWith(newGnsDS.

Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread Ted Yu
I searched 1.6.1 code base but didn't find how this can be configured (within Spark). On Fri, Apr 8, 2016 at 9:01 AM, nihed mbarek wrote: > Hi > How to configure parquet.block.size on Spark 1.6 ? > > Thank you > Nihed MBAREK > > > -- > > M'BAREK Med Nihed, > Fedora Ambassador, TUNISIA, Northern

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
Did you encounter similar error on a smaller dataset ? Which release of Spark are you using ? Is it possible you have an incompatible snappy version somewhere in your classpath ? Thanks On Fri, Apr 8, 2016 at 12:36 PM, entee wrote: > I'm trying to do a relatively large join (0.5TB shuffle rea

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
gt; pyspark.sql on a Spark DataFrame. > > Any ideas? > > Nicolas > > On Fri, Apr 8, 2016 at 1:13 PM, Ted Yu wrote: > >> Did you encounter similar error on a smaller dataset ? >> >> Which release of Spark are you using ? >> >> Is it possible

Re: Unable run Spark in YARN mode

2016-04-09 Thread Ted Yu
mahesh : bq. :16: error: not found: value sqlContext Please take a look at: https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext for how the import should be used. Please include version of Spark and the commandline you used in the reply.

Re: Weird error while serialization

2016-04-09 Thread Ted Yu
The value was out of the range of integer. Which Spark release are you using ? Can you post snippet of code which can reproduce the error ? Thanks On Sat, Apr 9, 2016 at 12:25 PM, SURAJ SHETH wrote: > I am trying to perform some processing and cache and count the RDD. > Any solutions? > > See

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Ted Yu
Looks like the exception occurred on driver. Consider increasing the values for the following config: conf.set("spark.driver.memory", "10240m") conf.set("spark.driver.maxResultSize", "2g") Cheers On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev wrote: > I'm running it via pyspark against yarn in cli

Re: Datasets combineByKey

2016-04-10 Thread Ted Yu
Haven't found any JIRA w.r.t. combineByKey for Dataset. What's your use case ? Thanks On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela wrote: > Is there (planned ?) a combineByKey support for Dataset ? > Is / Will there be a support for combiner lifting ? > > Thanks, > Amit >

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-10 Thread Ted Yu
nd Events. > > I can do the event wise summation for couple of runs and get back to you. > > > > Thanks, > > Jasmine > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* Thursday, April 07, 2016 1:43 PM > *To:* JasmineGeorge > *Cc:* user > *

Re: Weird error while serialization

2016-04-10 Thread Ted Yu
mbda x : x.rsplit('\t',1)).map(lambda x : > [x[0],getRows(x[1])]).cache()\ > .groupBy(lambda x : x[0].split('\t')[1]).mapValues(lambda x : > list(x)).cache() > > text1.count() > > Thanks and Regards, > Suraj Sheth > > On Sun, Apr 10, 2016 at

Re: Hello !

2016-04-11 Thread Ted Yu
For SparkR, please refer to https://spark.apache.org/docs/latest/sparkr.html bq. on Ubuntu or CentOS Both platforms are supported. On Mon, Apr 11, 2016 at 1:08 PM, wrote: > Dear Experts , > > I am posting this for your information. I am a newbie to spark. > I am interested in understanding Spa

Re: Read JSON in Dataframe and query

2016-04-11 Thread Ted Yu
Please take a look at sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Cheers On Mon, Apr 11, 2016 at 12:13 PM, Radhakrishnan Iyer < radhakrishnan.i...@citiustech.com> wrote: > Hi all, > > > > I am new to Spark. > > I have a json in below format : > > Empl

Re: Is storage resources counted during the scheduling

2016-04-11 Thread Ted Yu
See https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application On Mon, Apr 11, 2016 at 3:15 PM, Jialin Liu wrote: > Hi Spark users/experts, > > I’m wondering how does the Spark scheduler work? > What kind of resources will be considered during the scheduling, does

Re: build/sbt gen-idea error

2016-04-12 Thread Ted Yu
gen-idea doesn't seem to be a valid command: [warn] Ignoring load failure: no project loaded. [error] Not a valid command: gen-idea [error] gen-idea On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote: > Hi, > I have cloned spark and , > cd spark > build/sbt gen-idea > > got the fol

Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup On Tue, Apr 12, 2016 at 8:52 AM, ImMr.K <875061...@qq.com> wrote: > But how to import spark repo into idea or eclipse? > > > > -- 原始邮件 ---------

Re: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe

2016-04-12 Thread Ted Yu
bq. Most recent failure cause: Can you paste the remaining cause ? Which Spark release are you using ? Thanks On Tue, Apr 12, 2016 at 1:10 PM, AlexModestov wrote: > I get an error while I form a dataframe from the parquet file: > > Py4JJavaError: An error occurred while calling > z:org.apache

Re: JavaRDD with custom class?

2016-04-12 Thread Ted Yu
You can find various examples involving Serializable Java POJO e.g. ./examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java Please pastebin some details on 'Task not serializable error' Thanks On Tue, Apr 12, 2016 at 12:44 PM, Daniel Valdivia wrote: > Hi, > > I'm moving some

Re: Old hostname pops up while running Spark app

2016-04-12 Thread Ted Yu
FYI https://documentation.cpanel.net/display/CKB/How+To+Clear+Your+DNS+Cache#HowToClearYourDNSCache-MacOS ®10.10 https://www.whatsmydns.net/flush-dns.html#linux On Tue, Apr 12, 2016 at 2:44 PM, Bibudh Lahiri wrote: > Hi, > > I am trying to run a piece of code with logistic regression on > P

Re: Logging in executors

2016-04-13 Thread Ted Yu
bq. --conf "spark.executor.extraJavaOptions=-Dlog4j. configuration=env/dev/log4j-driver.properties" I think the above may have a typo : you refer to log4j-driver.properties in both arguments. FYI On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas wrote: > Hi guys, > > I'm trying to enable log

Re: Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Ted Yu
w.r.t. the effective storage level log, here is the JIRA which introduced it: [SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin wrote: > Hi all, > > If I am using a Custom Receiver with Storage Level set to StorageLevel.

Re: Spark Yarn closing sparkContext

2016-04-14 Thread Ted Yu
Can you pastebin the failure message ? Did you happen to take jstack during the close ? Which Hadoop version do you use ? Thanks > On Apr 14, 2016, at 5:53 AM, nihed mbarek wrote: > > Hi, > I have an issue with closing my application context, the process take a long > time with a fail at t

Re: Error with --files

2016-04-14 Thread Ted Yu
bq. localtest.txt#appSees.txt Which file did you want to pass ? Thanks On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen wrote: > Hi All, > > I'm trying to use the --files option with yarn: > > spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files >> /home/ubuntu/localtest.txt#

Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Ted Yu
For Parquet, please take a look at SPARK-1251 For ORC, not sure. Looking at git history, I found ORC mentioned by SPARK-1368 FYI On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli wrote: > I am needing this fact for the research paper I am writing right now. > > When did Spark start supporting Parq

Re: How to stop hivecontext

2016-04-15 Thread Ted Yu
You can call stop() method. > On Apr 15, 2016, at 5:21 AM, ram kumar wrote: > > Hi, > I started hivecontext as, > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); > > I want to stop this sql context > > Thanks

Re: Logging in executors

2016-04-15 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1 On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas wrote: > Hi guys, > > any clue on this? Clearly the > spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the > executors. > > Thanks, > -carlos. > > On Wed, Apr 13,

Re: ERROR [main] client.ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper.

2016-04-16 Thread Ted Yu
Please send query to user@hbase This is the default value: zookeeper.znode.parent /hbase Looks like hbase-site.xml accessible on your client didn't have up-to-date value for zookeeper.znode.parent Please make sure hbase-site.xml with proper config is on the classpath. On Sat, Apr 16, 20

Re: Apache Flink

2016-04-16 Thread Ted Yu
Looks like this question is more relevant on flink mailing list :-) On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh wrote: > Hi, > > Has anyone used Apache Flink instead of Spark by any chance > > I am interested in its set of libraries for Complex Event Processing. > > Frankly I don't know if

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Ted Yu
Kevin: Can you describe how you got over the Metadata fetch exception ? > On Apr 16, 2016, at 9:41 AM, Kevin Eid wrote: > > One last email to announce that I've fixed all of the issues. Don't hesitate > to contact me if you encounter the same. I'd be happy to help. > > Regards, > Kevin > >> O

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
>From the output you posted: --- Unpacking Spark gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now --- The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt. This problem has been reported in other threads. Try spark-1.6.1-bin-hadoop

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
Apr 16, 2016 at 2:14 PM, Ted Yu wrote: > From the output you posted: > --- > Unpacking Spark > > gzip: stdin: not in gzip format > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > --- > > The artifact for spark-1.6.1-bin-hadoop2.6 i

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
to > https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz > and I get a NoSuchKey error. > > Should I just go with it even though it says hadoop2.6? > > On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu wrote: > >> BTW this was the original thread: &g

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
bucket, so hopefully everything should be > working now. Let me know if you still encounter any problems with > unarchiving. > > On Sat, Apr 16, 2016 at 3:10 PM Ted Yu wrote: > >> Pardon me - there is no tarball for hadoop 2.7 >> >> I downloaded >> https://s

Re: Logging in executors

2016-04-18 Thread Ted Yu
tely not working, at least for logging configuration. > > Thanks, > -carlos. > > On Fri, Apr 15, 2016 at 3:28 PM, Ted Yu wrote: > >> See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1 >> >> On Fri, Apr 15, 2016 at 5:38 AM, Carlos Ro

Re: hbaseAdmin tableExists create catalogTracker for every call

2016-04-19 Thread Ted Yu
The CatalogTracker object may not be used by all the methods of HBaseAdmin. Meaning, when HBaseAdmin is constructed, we don't need CatalogTracker. On Tue, Apr 19, 2016 at 6:09 AM, WangYQ wrote: > in hbase 0.98.10, class "HBaseAdmin " > line 303, method "tableExists", will create a catal

Re: Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-19 Thread Ted Yu
Using http://www.ruddwire.com/handy-code/date-to-millisecond-calculators/#.VxZh3iMrKuo , 1460823008000 is shown to be 'Sat Apr 16 2016 09:10:08 GMT-0700' Can you clarify the 4 day difference ? bq. 'right now April 14th' The date of your email was Apr 16th. On Sat, Apr 16, 2016 at 9:39 AM, Hemal

Re: Why very small work load cause GC overhead limit?

2016-04-19 Thread Ted Yu
Can you tell us the memory parameters you used ? If you can capture jmap before the GC limit was exceeded, that would give us more clue. Thanks > On Apr 19, 2016, at 7:40 PM, "kramer2...@126.com" wrote: > > Hi All > > I use spark doing some calculation. > The situation is > 1. New file wi

Re: 回复:Spark sql and hive into different result with same sql

2016-04-20 Thread Ted Yu
Do you mind trying out build from master branch ? 1.5.3 is a bit old. On Wed, Apr 20, 2016 at 5:25 AM, FangFang Chen wrote: > I found spark sql lost precision, and handle data as int with some rule. > Following is data got via hive shell and spark sql, with same sql to same > hive table: > Hive

Re: Invoking SparkR from Spark shell

2016-04-20 Thread Ted Yu
Please take a look at: https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes On Wed, Apr 20, 2016 at 9:50 AM, Ashok Kumar wrote: > Hi, > > I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R > with Spark. > > Is there a s hell similar to spark-shell that support

Re: Unable to improve ListStatus performance of ParquetRelation

2016-04-20 Thread Ted Yu
FileStatusCache used to be inside interfaces.scala But in master branch, I no longer see it there. Looks like refactor has removed the class. On Wed, Apr 20, 2016 at 11:19 AM, Ditesh Kumar wrote: > Hi, > > When creating a DataFrame from a partitioned file structure ( > sqlContext.read.parquet("

Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Ted Yu
The weight field is not nullable. Looks like your source table had null value for this field. On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu < cprenzb...@gmail.com> wrote: > Hi, > > I am using spark 1.4.1 and trying to copy all rows from a table in one > MySQL Database to a Amazon RDS

Re: StructField Translation Error with Spark SQL

2016-04-21 Thread Ted Yu
> > Can't translate null value for field > StructField(density,DecimalType(4,2),true) > On Apr 21, 2016 1:37 AM, "Ted Yu" wrote: > >> The weight field is not nullable. >> >> Looks like your source table had null value for this field. >> >>

Re: RDD generated from Dataframes

2016-04-21 Thread Ted Yu
In upcoming 2.0 release, the signature for map() has become: def map[U : Encoder](func: T => U): Dataset[U] = withTypedPlan { Note: DataFrame and DataSet are unified in 2.0 FYI On Thu, Apr 21, 2016 at 6:49 AM, Apurva Nandan wrote: > Hello everyone, > > Generally speaking, I guess it's well

<    1   2   3   4   5   6   7   8   9   10   >