Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
I noticed that in the main branch, the ec2 directory along with the spark-ec2 script is no longer present. Is spark-ec2 going away in the next release? If so, what would be the best alternative at that time? A couple more additional questions: 1. Is there any way to add/remove additional workers

Re: How to debug ClassCastException: java.lang.String cannot be cast to java.lang.Long in SparkSQL

2016-01-27 Thread Jakob Odersky
> the data type mapping has been taken care of in my code, could you share this? On Tue, Jan 26, 2016 at 8:30 PM, Anfernee Xu wrote: > Hi, > > I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from > 3rdparty datasource, the data type mapping has been

RE: Spark SQL joins taking too long

2016-01-27 Thread Cheng, Hao
Another possibility is about the parallelism? Probably be 1 or some other small value, since the input data size is not that big. If in that case, probably you can try something like: Df1.repartition(10).registerTempTable(“hospitals”); Df2.repartition(10).registerTempTable(“counties”); … And

Spark streaming flow control and back pressure

2016-01-27 Thread Lin Zhao
I have an actor receiver that reads data and calls "store()" to save data to spark. I was hoping spark.streaming.receiver.maxRate and spark.streaming.backpressure would help me block the method when needed to avoid overflowing the pipeline. But it doesn't. My actor pumps millions of lines to

Re: Escaping tabs and newlines not working

2016-01-27 Thread Jakob Odersky
Can you provide some code the reproduces the issue, specifically in a spark job? The linked stackoverflow question is related to plain scala and the proposed answers offer a solution. On Wed, Jan 27, 2016 at 1:57 PM, Harshvardhan Chauhan wrote: > > > Hi, > > Escaping newline

Re: Is spark-ec2 going away?

2016-01-27 Thread Nicholas Chammas
I noticed that in the main branch, the ec2 directory along with the spark-ec2 script is no longer present. It’s been moved out of the main repo to its own location: https://github.com/amplab/spark-ec2/pull/21 Is spark-ec2 going away in the next release? If so, what would be the best alternative

Re: Is spark-ec2 going away?

2016-01-27 Thread Alexander Pivovarov
you can use EMR-4.3.0 run on spot instances to control the price yes, you can add/remove instances to the cluster on fly (CORE instances support add only, TASK instances - add and remove) On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung wrote: > I noticed that in

RE: JSON to SQL

2016-01-27 Thread Cheng, Hao
Have you ever try the DataFrame API like: sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer the type/schema for you. And lateral view will help on the flatten issues, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, as well as the “a.b[0].c”

Re: Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
Hm thanks, I think what you are suggesting sounds like a recommendation for AWS EMR. However, my questions were wrt spark-ec2. For our uses involving spot-instances, EMR could potentially double/triple prices due to the additional premiums. Thanks anyway! On Wed, Jan 27, 2016 at 2:12 PM,

Re: Spark streaming flow control and back pressure

2016-01-27 Thread Lin Zhao
One solution is to read the scheduling delay and my actor can go to sleep if needed. Is this possible? From: Lin Zhao > Date: Wednesday, January 27, 2016 at 5:28 PM To: "user@spark.apache.org"

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Sathish Kumaran Vairavelu
Hi, On the same Spark/Mesos/Docker setup, I am getting warning "Initial Job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources". I am running in coarse grained mode. Any pointers on how to fix this issue? Please help. I have

corresponding sql for query against LocalRelation

2016-01-27 Thread ey-chih chow
Hi, For a query against the LocalRelation, is there anybody know what does the corresponding SQL looks like? Thanks. Best regards, Ey-Chih Chow -- View this message in context:

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
mapPartitions(...) seems like a good candidate, since, it's processing over a partition while maintaining state across map(...) calls. On Wed, Jan 27, 2016 at 6:58 PM, Ted Yu wrote: > Initially I thought of using accumulators. > > Since state change can be anything, how

Re: Is spark-ec2 going away?

2016-01-27 Thread Nick Pentreath
If I recall correctly, there is no additional premium for using EMR unless you use one of the MapR distributions they offer, or the other value adds. So a vanilla EMR cluster with spot instances will be no different cost than using spark-ec2. Sent from my iPhone > On 28 Jan 2016, at 01:34,

Re: NA value handling in sparkR

2016-01-27 Thread Hyukjin Kwon
Hm.. As far as I remember, you can set the value to treat as null with *nullValue* option. Although I am hitting network issues with Github so I can't check this now but please try that option as described in https://github.com/databricks/spark-csv. 2016-01-28 0:55 GMT+09:00 Felix Cheung

Re: Maintain state outside rdd

2016-01-27 Thread Ted Yu
Have you looked at this method ? * Zips this RDD with its element indices. The ordering is first based on the partition index ... def zipWithIndex(): RDD[(T, Long)] = withScope { On Wed, Jan 27, 2016 at 6:03 PM, Krishna wrote: > Hi, > > I've a scenario where I need

Re: Maintain state outside rdd

2016-01-27 Thread Jakob Odersky
Be careful with mapPartitions though, since it is executed on worker nodes, you may not see side-effects locally. Is it not possible to represent your state changes as part of your rdd's transformations? I.e. return a tuple containing the modified data and some accumulated state. If that really

Maintain state outside rdd

2016-01-27 Thread Krishna
Hi, I've a scenario where I need to maintain state that is local to a worker that can change during map operation. What's the best way to handle this? *incr = 0* *def row_index():* * global incr* * incr += 1* * return incr* *out_rdd = inp_rdd.map(lambda x: row_index()).collect()* "out_rdd"

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
Executing on worker is fine since state I would like to maintain is specific to a partition. Accumulators, being global counters, wont work. On Wednesday, January 27, 2016, Jakob Odersky wrote: > Be careful with mapPartitions though, since it is executed on worker > nodes,

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
Thanks; What I'm looking for is a way to see changes to the state of some variable during map(..) phase. I simplified the scenario in my example by making row_index() increment "incr" by 1 but in reality, the change to "incr" can be anything. On Wed, Jan 27, 2016 at 6:25 PM, Ted Yu

Re: corresponding sql for query against LocalRelation

2016-01-27 Thread Jeff Zhang
I think LocalRelation is used for dataframe, >> val df = sqlContext.createDataFrame(Seq((1,"jeff"),(2, "andy"))) >> df.explain(true) == Parsed Logical Plan == LocalRelation [_1#0,_2#1], [[1,jeff],[2,andy]] == Analyzed Logical Plan == _1: int, _2: string LocalRelation [_1#0,_2#1],

Re: Spark SQL joins taking too long

2016-01-27 Thread Raghu Ganti
Why would changing the order of the join make such a big difference? I will try the repartition, although, it does not make sense to me why repartitioning should help, since the data itself is so small! Regards, Raghu > On Jan 27, 2016, at 20:08, Cheng, Hao wrote: > >

Re: Maintain state outside rdd

2016-01-27 Thread Ted Yu
Initially I thought of using accumulators. Since state change can be anything, how about storing state in external NoSQL store such as hbase ? On Wed, Jan 27, 2016 at 6:37 PM, Krishna wrote: > Thanks; What I'm looking for is a way to see changes to the state of some >

Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
I'm using DataFrames reading the JSON exactly as you say, and I can get the scheme from there. Reading the documentation, I realized that is possible to create Dynamically a Structure, so applying some transformations to the dataFrame plus the new structure I'll be able to save the JSON on my

Re: How data locality is honored when spark is running on yarn

2016-01-27 Thread Saisai Shao
Hi Todd, There're two levels of locality based scheduling when you run Spark on Yarn if dynamic allocation enabled: 1. Container allocation is based on the locality ratio of pending tasks, this is Yarn specific and only works with dynamic allocation enabled. 2. Task scheduling is locality

Re: Python UDFs

2016-01-27 Thread Jakob Odersky
Have you checked: - the mllib doc for python https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector - the udf doc https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.udf You should be fine in returning a DenseVector

Re: Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
Thanks! That's very helpful. On Wed, Jan 27, 2016 at 3:33 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I noticed that in the main branch, the ec2 directory along with the > spark-ec2 script is no longer present. > > It’s been moved out of the main repo to its own location: >

Neo4j and Spark/GraphX

2016-01-27 Thread Sahil Sareen
Hey everyone! I'm using spark and graphx for graph processing and wish to export a subgraph to Neo4j(from the spark-submit console) for visualisation and basic graph querying that neo4j supports. I looked at the mazerunner project but it seems to be overkill. Any alternatives? -Sahil

how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread kali.tumm...@gmail.com
Hi All, Just realized cloudera version of spark on my cluster is 1.2, the jar which I built using maven is version 1.6 which is causing issue. Is there a way to run spark version 1.6 in 1.2 version of spark ? Thanks Sri -- View this message in context:

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread honaink
Hi Sri, Each node on the cluster where spark can run will have 1.2 version of spark. If you can, you need to update the cluster to 1.6 spark. Otherwise, you can't run 1.6 on those nodes. -honain kali.tumm...@gmail.com wrote > Hi All, > > Just realized cloudera version of spark on my

Re: JSON to SQL

2016-01-27 Thread Al Pivonka
Are you using an Relational Database? If so why not use a nojs DB ? then pull from it to your relational? Or utilize a library that understands Json structure like Jackson to obtain the data from the Json structure the persist the Domain Objects ? On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-27 Thread @Sanjiv Singh
Hi Ted , Its typo. Regards Sanjiv Singh Mob : +091 9990-447-339 On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu wrote: > In the last snippet, temptable is shown by 'show tables' command. > Yet you queried tampTable. > > I believe this just was typo :-) > > On Wed, Jan 27, 2016

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Sathish Kumaran Vairavelu
Thanks a lot for your info! I will try this today. On Wed, Jan 27, 2016 at 9:29 AM Mao Geng wrote: > Hi Sathish, > > The docker image is normal, no AWS profile included. > > When the driver container runs with --net=host, the driver host's AWS > profile will take effect so

Re: Spark 2.0.0 release plan

2016-01-27 Thread Daniel Siegmann
Will there continue to be monthly releases on the 1.6.x branch during the additional time for bug fixes and such? On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers wrote: > thanks thats all i needed > > On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote: > >>

Saving a pipeline model ?

2016-01-27 Thread Vinayak Agrawal
Hi, I working with Spark ML package which creates a pipeline model. I am looking for a way to save this model so that I can use it. My code is in pyspark. I found this jira which say that this feature is currently in progress. https://issues.apache.org/jira/browse/SPARK-6725 My question is, how

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-27 Thread Ted Yu
In the last snippet, temptable is shown by 'show tables' command. Yet you queried tampTable. I believe this just was typo :-) On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh wrote: > Hi All, > > I have configured Spark to query on hive table. > > Run the Thrift JDBC/ODBC

Storing JavaDStream into a hive table

2016-01-27 Thread samrat
Hello, I have the below code, JavaReceiverInputDStream messages = FlumeUtils.createStream(sc, host, port); Is there a way to store the above created stream(i.e messages) into a hive table., i.e basically I want to store the spark streaming data into a hive table. Thank you, samrat -- View

Re: Spark, Mesos, Docker and S3

2016-01-27 Thread Mao Geng
Hi Sathish, The docker image is normal, no AWS profile included. When the driver container runs with --net=host, the driver host's AWS profile will take effect so that the driver can access the protected s3 files. Similarly, Mesos slaves also run Spark executor docker container in

Re: NA value handling in sparkR

2016-01-27 Thread Felix Cheung
That's correct - and because spark-csv as Spark package is not specifically aware of R's notion of  NA and interprets it as a string value. On the other hand, R native NA is converted to NULL on Spark when creating a Spark DataFrame from a R data.frame. 

Compile error when compiling spark 2.0.0 snapshot code base in IDEA

2016-01-27 Thread Todd
Hi, I am able to maven install the whole spark project(from github ) in my IDEA. But, when I run the SparkPi example, IDEA compiles the code again and following exeception is thrown, Does someone meet this problem? Thanks a lot. Error:scalac: while compiling:

Re: spark streaming web ui not showing the events - direct kafka api

2016-01-27 Thread Cody Koeninger
Have you tried spark 1.5? On Wed, Jan 27, 2016 at 11:14 AM, vimal dinakaran wrote: > Hi , > I am using spark 1.4 with direct kafka api . In my streaming ui , I am > able to see the events listed in UI only if add stream.print() statements > or else event rate and input

Hive on Spark knobs

2016-01-27 Thread Ruslan Dautkhanov
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started There are quite a lot of knobs to tune for Hive on Spark. Above page recommends following settings: mapreduce.input.fileinputformat.split.maxsize=75000 > hive.vectorized.execution.enabled=true >

Using Spark in mixed Java/Scala project

2016-01-27 Thread jeremycod
Hi, I have a mixed Java/Scala project. I have already been using Spark in Scala code in local mode. Now, some new team members should develop functionalities that should use Spark but in Java code, and they are not familiar with Scala. I know it's not possible to have two Spark contexts in the

spark streaming web ui not showing the events - direct kafka api

2016-01-27 Thread vimal dinakaran
Hi , I am using spark 1.4 with direct kafka api . In my streaming ui , I am able to see the events listed in UI only if add stream.print() statements or else event rate and input events remains in 0 eventhough the events gets processed. Without print statements , I have the action

Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
We dont have Domain Objects, its a service like a pipeline, data is read from source and they are saved it in relational Database I can read the structure from DataFrames, and do some transformations, I would prefer to do it with Spark to be consistent with the process On Wed, Jan 27, 2016 at

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread bo yang
Hi Eli, are you using Python? I see there is a method show(numRows) in Java, but not sure about Python. On Wed, Jan 27, 2016 at 2:39 AM, Akhil Das wrote: > Why would you want to print all rows? You can try the following: > > sqlContext.sql("select day_time from

Python UDFs

2016-01-27 Thread Stefan Panayotov
Hi, I have defined a UDF in Scala like this: import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} import org.apache.spark.mllib.linalg.DenseVector val determineVector = udf((a: Double, b: Double) => { val data:

Re: JSON to SQL

2016-01-27 Thread Sahil Sareen
Isn't this just about defining a case class and using parse(json).extract[CaseClassName] using Jackson? -Sahil On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi wrote: > We dont have Domain Objects, its a service like a pipeline, data is read > from source and they are saved

Re: Spark SQL joins taking too long

2016-01-27 Thread Raghu Ganti
The problem is with the way Spark query plan is being created, IMO, what was happening before is that the order of the tables mattered and when the larger table is given first, it took a very long time (~53mins to complete). I changed the order of the tables with the smaller one first (including

Re: Using Spark in mixed Java/Scala project

2016-01-27 Thread Jakob Odersky
JavaSparkContext has a wrapper constructor for the "scala" SparkContext. In this case all you need to do is declare a SparkContext that is accessible both from the Java and Scala sides of your project and wrap the context with a JavaSparkContext. Search for java source compatibilty with scala for

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread Koert Kuipers
If you have yarn you can just launch your spark 1.6 job from a single machine with spark 1.6 available on it and ignore the version of spark (1.2) that is installed On Jan 27, 2016 11:29, "kali.tumm...@gmail.com" wrote: > Hi All, > > Just realized cloudera version of

Re: spark.kryo.classesToRegister

2016-01-27 Thread Shixiong(Ryan) Zhu
It depends. The default Kryo serializer cannot handle all cases. If you encounter any issue, you can follow the Kryo doc to set up custom serializer: https://github.com/EsotericSoftware/kryo/blob/master/README.md On Wed, Jan 27, 2016 at 3:13 AM, amit tewari wrote: > This

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread Kevin Mellott
I believe that *show* should work if you provide it with both the number of rows and the truncate flag. ex: df.show(10, false) http://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show On Wed, Jan 27, 2016 at 2:39 AM, Akhil Das

Re: Spark 2.0.0 release plan

2016-01-27 Thread Michael Armbrust
We do maintenance releases on demand when there is enough to justify doing one. I'm hoping to cut 1.6.1 soon, but have not had time yet. On Wed, Jan 27, 2016 at 8:12 AM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Will there continue to be monthly releases on the 1.6.x branch during

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread sri hari kali charan Tummala
Hi Koert, I am submitting my code (spark jar ) using spark-submit in proxy node , I checked the version of the cluster and node its says 1.2 I dint really understand what you mean. can I ask yarn to use different version of spark ? or should I say override the spark_home variables to look at 1.6

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread Koert Kuipers
you need to build spark 1.6 for your hadoop distro, and put that on the proxy node and configure it correctly to find your cluster (hdfs and yarn). then use the spark-submit script for that spark 1.6 version to launch your application on yarn On Wed, Jan 27, 2016 at 3:11 PM, sri hari kali charan

Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
I'm really brand new with Scala, but if I'm defining a case class then is becouse I know how is the json's structure is previously? If I'm able to define dinamicaly a case class from the JSON structure then even with spark I will be able to extract the data On Wed, Jan 27, 2016 at 4:01 PM,

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread sm...@yahoo.com.INVALID
Kevin’s solution works. Just minor correction, Python boolean should be capitalized. That is df.show(10, False) > On Jan 27, 2016, at 12:34 PM, Kevin Mellott wrote: > > I believe that show should work if you provide it with both the number of > rows and the

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread Deenar Toraskar
Sri Look at the instructions here. They are for 1.5.1, but should also work for 1.6 https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa?trk=hp-feed-article-title-publish=true=true Deenar On 27 January 2016 at 20:16, Koert Kuipers wrote: > you need to

Online Learning for MLLib Forest Ensembles

2016-01-27 Thread Scott Imig
Hello, Is there an option for online (or incremental / warm start) learning for the MLLib RandomForest ensembles? Thanks, Imig -- S. Imig | Senior Data Scientist Engineer | richrelevance |m: 425.999.5725 I support Bip 101 and BitcoinXT.

Re: Using Spark in mixed Java/Scala project

2016-01-27 Thread Zoran Jeremic
Hi Jakob, Thanks a lot for your help. I'll try this. Zoran On Wed, Jan 27, 2016 at 10:49 AM, Jakob Odersky wrote: > JavaSparkContext has a wrapper constructor for the "scala" > SparkContext. In this case all you need to do is declare a > SparkContext that is accessible both

Re: hivethriftserver2 problems on upgrade to 1.6.0

2016-01-27 Thread Deenar Toraskar
James The problem you are facing is due to a feature introduced in Spark 1.6 - multi-session mode, if you want to see temporary tables across session, *set spark.sql.hive.thriftServer.singleSession=true* - From Spark 1.6, by default the Thrift server runs in multi-session mode. Which

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread sri hari kali charan Tummala
Thank you very much, well documented. Thanks Sri On Wed, Jan 27, 2016 at 8:46 PM, Deenar Toraskar wrote: > Sri > > Look at the instructions here. They are for 1.5.1, but should also work > for 1.6 > > >

Escaping tabs and newlines not working

2016-01-27 Thread Harshvardhan Chauhan
Hi, Escaping newline and tad dosent seem to work for me. Spark version 1.5.2 on emr reading files from s3 here is more details about my issue Scala escaping newline and tab characters I am trying to use the following code to get rid of tab and newline characters in the url but I still get

How data locality is honored when spark is running on yarn

2016-01-27 Thread Todd
Hi, I am kind of confused about how data locality is honored when spark is running on yarn(client or cluster mode),can someone please elaberate on this? Thanks!

RE: ctas fails with "No plan for CreateTableAsSelect"

2016-01-27 Thread Yu, Yucai
As per this document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS), Hive CTAS has the restriction that the target table cannot be a partitioned table. Tejas also pointed out you has no need specify the column information as it

Re: spark-streaming with checkpointing: error with sparkOnHBase lib

2016-01-27 Thread Akhil Das
Were you able to resolve this? It'd be good if you can paste the code snippet to reproduce this. Thanks Best Regards On Fri, Jan 22, 2016 at 2:06 PM, vinay gupta wrote: > Hi, > I have a spark-streaming application which uses sparkOnHBase lib to do >

[Problem Solved]Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, dears, the problem has been solved. I mistakely use tachyon.user.block.size.bytes instead of tachyon.user.block.size.bytes.default. It works now. Sorry for the confusion and thanks again to Gene! Best Regards, Jia On Wed, Jan 27, 2016 at 4:59 AM, Jia Zou wrote: >

Re: MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-27 Thread Akhil Das
Did you try enabling spark.memory.useLegacyMode and upping spark.storage.memoryFraction? Thanks Best Regards On Fri, Jan 22, 2016 at 3:40 AM, Arun Luthra wrote: > WARN MemoryStore: Not enough space to cache broadcast_4 in memory! > (computed 60.2 MB so far) > WARN

Re: NPE from sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply?

2016-01-27 Thread Jacek Laskowski
Hi Michael, Thanks for the prompt response! Do you have any idea where to start (the code that leads to the issue is so...cough...terrible [1] that it's hard to guess where to start from)? I'll think about the test case to reproduce. [1] I'm the co-author :) Pozdrawiam, Jacek Jacek Laskowski

TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
Dears, I keep getting below exception when using Spark 1.6.0 on top of Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH. Any suggestions will be appreciated, thanks! = Exception in thread "main" org.apache.spark.SparkException: Job aborted

spark.kryo.classesToRegister

2016-01-27 Thread amit tewari
This is what I have added in my code: rdd.persist(StorageLevel.MEMORY_ONLY_SER()) conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"); Do I compulsorily need to do anything via : spark.kryo.classesToRegister? Or the above code sufficient to achieve performance gain

Re: NA value handling in sparkR

2016-01-27 Thread Devesh Raj Singh
Hi, While dealing with missing values with R and SparkR I observed the following. Please tell me if I am right or wrong? Missing values in native R are represented with a logical constant-NA. SparkR DataFrames represents missing values with NULL. If you use createDataFrame() to turn a local R

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
BTW. The tachyon worker log says following: 2015-12-27 01:33:44,599 ERROR WORKER_LOGGER (WorkerBlockMasterClient.java:getId) - java.net.SocketException: Connection reset org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
BTW. At the end of the log, I also find a lot of errors like below: = 2016-01-27 11:47:18,515 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run) - Error occurred during processing of message. java.lang.NullPointerException at

Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das
Have a look at the TPC-H queries, I found this repository with the quries. https://github.com/ssavvides/tpch-spark Thanks Best Regards On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa wrote: > Hi, > I have downloaded the Amplab benchmark dataset from >

Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, Gene, Thanks for your suggestion. However, even if I set tachyon.user.block.size.bytes=134217728, and I can see that from the web console, the files that I load to Tachyon via copyToLocal, still has 512MB block size. Do you have more suggestions? Best Regards, Jia On Tue, Jan 26, 2016 at

ZlibFactor warning

2016-01-27 Thread Eli Super
Hi I'm running spark locally on win 2012 R2 server No hadoop installed I'm getting following error : *WARN ZlibFactory: Failed to load/initialize native-zlib library* *Is it something to wary about ?* Thanks !

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-27 Thread Gourav Sengupta
Hi, It may be interesting to see this. Can you please create a hivecontext (using standard AWS Spark stack on EMR 4.0) and create a table to read the avro file and read data into a dataframe using hivecontext sql? Please let me know if i can be of any help with this. Regards, Gourav On Wed,

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread Akhil Das
Why would you want to print all rows? You can try the following: sqlContext.sql("select day_time from my_table limit 10").collect().foreach(println) Thanks Best Regards On Sun, Jan 24, 2016 at 5:58 PM, Eli Super wrote: > Unfortunately still getting error when use

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-27 Thread Erisa Dervishi
Hi, I think I have the same issue mentioned here: https://issues.apache.org/jira/browse/SPARK-8898 I tried to run the job with 1 core and it didn't hang anymore. I can live with that for now, but any suggestions are welcome. Erisa On Tue, Jan 26, 2016 at 4:51 PM, Erisa Dervishi

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
BTW. the error happens when configure Spark to read input file from Tachyon like following: /home/ubuntu/spark-1.6.0/bin/spark-submit --properties-file /home/ubuntu/HiBench/report/kmeans/spark/java/conf/sparkbench/spark.conf --class org.apache.spark.examples.mllib.JavaKMeans --master spark://ip

Re: JSON to SQL

2016-01-27 Thread Al Pivonka
More detail is needed. Can you provide some context to the use-case ? On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi wrote: > Hello, I'm trying to Save a JSON filo into SQL table. > > If i try to do this directly the IlligalArgumentException is raised, I > suppose this is

Re: JSON to SQL

2016-01-27 Thread Andrés Ivaldi
Sure, The Job is like an etl, but without interface, so I decide the rules of how the JSON will be saved into a SQL Table. I need to Flatten the hierarchies where is possible in case of list flatten also, nested objects Won't be processed by now {"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-27 Thread Andres.Fernandez
So far, still cannot find a way of running a small Scala script right after executing the shell, and get the shell to remain open. Is there a way of doing this? Feels like a simple/naive question but really couldn’t find an answer. From: Fernandez, Andres Sent: Tuesday, January 26, 2016 2:53 PM

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-27 Thread David Brooks
Hi Ram, Yes, I complete agree. An exception is poor way to handle this case, and training on a dataset of zero labels and no one labels should simply work without exceptions. Fortunately, it looks like someone else has recently patched the problem with LogisticRegression:

JSON to SQL

2016-01-27 Thread Andrés Ivaldi
Hello, I'm trying to Save a JSON filo into SQL table. If i try to do this directly the IlligalArgumentException is raised, I suppose this is beacouse JSON have a hierarchical structure, is that correct? If that is the problem, how can I flatten the JSON structure? The JSON structure to be

Re: spark streaming input rate strange

2016-01-27 Thread Akhil Das
How are you verifying the data dropping? Can you send 10k, 20k events and write the same to an output location from spark streaming and verify it? If you are finding a data mismatch then its a problem with your MulticastSocket implementation. Thanks Best Regards On Fri, Jan 22, 2016 at 5:44 PM,

Re: How to send a file to database using spark streaming

2016-01-27 Thread Akhil Das
This is a good start https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md Thanks Best Regards On Sat, Jan 23, 2016 at 12:19 PM, Sree Eedupuganti wrote: > New to Spark Streaming. My question is i want to load the XML files to > database

Re: Debug what is replication Level of which RDD

2016-01-27 Thread Akhil Das
How many RDDs are you persisting? If its 2, then you can verify it by disabling the persist for one of them and from the UI you can see which one of mappedRDD/shuffledRDD. Thanks Best Regards On Sun, Jan 24, 2016 at 3:25 AM, gaurav sharma wrote: > Hi All, > > I have

Having issue with Spark SQL JDBC on hive table !!!

2016-01-27 Thread @Sanjiv Singh
Hi All, I have configured Spark to query on hive table. Run the Thrift JDBC/ODBC server using below command : *cd $SPARK_HOME* *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf hive.server2.thrift.bind.host=myhost --hiveconf hive.server2.thrift.port=* and also able to

help with enabling spark dynamic allocation

2016-01-27 Thread varuni gang
Hi, As per spark documentation for spark's Dynamic Resource Allocation. I did the following to enable shuffle/ Dynamic allocation service: A) Added the following lines to "spark-defaults.conf" enabling dynamic resource allocation and shuffle service spark.dynamicAllocation.enabled=true