how to create custom data source?

2015-06-24 Thread 诺铁
hi, I want to use spark to analyze source code :) Since code have dependency between lines, it's not possible to just treat it as lines. So I am considering to provide my own datasource for source code, but there isn't much documentation about datasource api, where can I learn to do this?

Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Steve Loughran
On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.commailto:ccn...@gmail.com wrote: Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun

Re: Worker is KILLED for no reason

2015-06-24 Thread Demi Ben-Ari
Hi, I've open up an issue bug on the Spark project on JIRA: https://issues.apache.org/jira/browse/SPARK-8557 Would really appreciate some insights on the issue, *It's strange that no one else encountered the problem.* Have a great day! On Mon, Jun 15, 2015 at 12:03 PM, nizang ni...@windward.eu

Re: Velox Model Server

2015-06-24 Thread Debasish Das
Model sizes are 10m x rank, 100k x rank range. For recommendation/topic modeling I can run batch recommendAll and then keep serving the model using a distributed cache but then I can't incorporate per user model re-predict if user feedback is making the current topk stale. I have to wait for next

Re: Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-24 Thread Akhil Das
Depending the size of the memory you are having, you ccould allocate 60-80% of the memory for the spark worker process. Datanode doesn't require too much memory. On 23 Jun 2015 21:26, maxdml max...@cs.duke.edu wrote: I'm wondering if there is a real benefit for splitting my memory in two for

How to use KryoSerializer : ClassNotFoundException

2015-06-24 Thread pth001
Hi, I am using spark 1.4. I wanted to serialize by KryoSerializer, but got ClassNotFoundException. The configuration and exception is below. When I submitted the job, I also provided --jars mylib.jar which contains WRFVariableZ. conf.set(spark.serializer,

Re: Velox Model Server

2015-06-24 Thread Nick Pentreath
Ok My view is with only 100k items, you are better off serving in-memory for items vectors. i.e. store all item vectors in memory, and compute user * item score on-demand. In most applications only a small proportion of users are active, so really you don't need all 10m user vectors in memory.

Re: Velox Model Server

2015-06-24 Thread Sean Owen
On Wed, Jun 24, 2015 at 12:02 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Oryx does almost the same but Oryx1 kept all user and item vectors in memory (though I am not sure about whether Oryx2 still stores all user and item vectors in memory or partitions in some way). (Yes, this is a

mllib from sparkR

2015-06-24 Thread escardovi
Hi, I was wondering if it is possible to use MLlib function inside SparkR, as outlined at the Spark Summer East 2015 Warmup meetup: http://www.meetup.com/Spark-NYC/events/220850389/ Are there available examples? Thank you! Elena -- View this message in context:

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
Ok so you are running Spark in a Standalone Mode then Then for every Worker process on every node (you can run more than one Worker per node) you will have an Executor waiting for jobs …. As far as I am concerned I think there are only two ways to achieve what you need: 1.

Question - writing data to Cassandra to Spark gives a strange error message

2015-06-24 Thread Koen Vantomme
Hello, Trying to write data from Spark to Cassandra. Reading data from Cassandra is ok, but writing seems to give a strange error. Exception in thread main scala.ScalaReflectionException: none is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259) The

Spark stream test throw org.apache.spark.SparkException: Task not serializable when execute in spark shell

2015-06-24 Thread yuemeng (A)
hi ,all there two examples one is throw Task not serializable when execute in spark shell,the other one is ok,i am very puzzled,can anyone give what's different about this two code and why the other is ok 1.The one which throw Task not serializable : import org.apache.spark._ import

Re: Spark Streaming: limit number of nodes

2015-06-24 Thread Wojciech Pituła
Ok, thanks. I have 1 worker process on each machine but I would like to run my app on only 3 of them. Is it possible? śr., 24.06.2015 o 11:44 użytkownik Evo Eftimov evo.efti...@isecc.com napisał: There is no direct one to one mapping between Executor and Node Executor is simply the spark

Re: Kafka createDirectStream ​issue

2015-06-24 Thread syepes
Hello, Thanks for all the help on resolving this issue, especially to Cody who guided me to the solution. For other facing similar issues, basically the issue was that I was running Spark Streaming jobs from the spark-shell and this is not supported. Running the same job through spark-submit

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
There is no direct one to one mapping between Executor and Node Executor is simply the spark framework term for JVM instance with some spark framework system code running in it A node is a physical server machine You can have more than one JVM per node And vice versa you can

Parquet problems

2015-06-24 Thread Anders Arpteg
When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown: Exception in thread task-result-getter-0 Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread

How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Gustavo Arjones
Hi All, I am using the new Apache Spark version 1.4.0 Data-frames API to extract information from Twitter's Status JSON, mostly focused on the Entities Object https://dev.twitter.com/overview/api/entities - the relevant part to this question is showed below: { ... ... entities: {

Re: EOFException using KryoSerializer

2015-06-24 Thread Jim Carroll
I finally got back to this and I just wanted to let anyone that runs into this know that the problem is a kryo version issue. Spark (at least 1.4.0) depends on Kryo 2.21 while my client had 2.24.0 on the classpath. Changing it to 2.21 fixed the problem. -- View this message in context:

Re: how to increase parallelism ?

2015-06-24 Thread ๏̯͡๏
What that did was run a repartition with 174 tasks repartition with 174 tasks AND actual .filter.map stage with 500 tasks It actually doubled to stages. On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi Deepak, Parallelism is controlled by the

Re: how to increase parallelism ?

2015-06-24 Thread Silvio Fiorito
Yes, it will introduce a shuffle stage in order to perform the repartitioning. So it’s more useful if you’re planning to do many downstream transformations for which you need the increased parallelism. Is this a dataset from HDFS? From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Wednesday, June 24, 2015 at 6:11 PM

Re: WorkFlow Processing - Spark

2015-06-24 Thread ayan guha
What kind of custom logic? On 25 Jun 2015 01:33, Ashish Soni asoni.le...@gmail.com wrote: Hi All , We are looking to use spark as our stream processing framework and it would be helpful if experts can weigh if we made a right choice given below requirement Given a stream of data we need to

Re: Understanding accumulator during transformations

2015-06-24 Thread Wei Zhou
Hi Burak, Thanks for your quick reply. I guess what confuses me is that accumulator won't be updated until an action is used due to the laziness, so transformation such as a map won't even update the accumulator, then how would restarted the transformation ended up updating accumulator more than

Aggregating metrics using Cassandra and Spark streaming

2015-06-24 Thread Mike Trienis
Hello, I'd like to understand how other people have been aggregating metrics using Spark Streaming and Cassandra database. Currently I have design some data models that will stored the rolled up metrics. There are two models that I am considering: CREATE TABLE rollup_using_counters (

Re: WorkFlow Processing - Spark

2015-06-24 Thread asoni . learn
Any custom script ( python or java or scala ) Thanks , Ashish On Jun 24, 2015, at 4:39 PM, ayan guha guha.a...@gmail.com wrote: What kind of custom logic? On 25 Jun 2015 01:33, Ashish Soni asoni.le...@gmail.com wrote: Hi All , We are looking to use spark as our stream processing

How to Map and Reduce in sparkR

2015-06-24 Thread Wei Zhou
Anyone knows whether sparkR supports map and reduce operations as the RDD transformations? Thanks in advance. Best, Wei

Re: Understanding accumulator during transformations

2015-06-24 Thread Wei Zhou
Hi Burak, It makes sense, it boils down to any actions happens after transformations then. Thanks for your answers. Best, Wei 2015-06-24 15:06 GMT-07:00 Burak Yavuz brk...@gmail.com: Hi Wei, During the action, all the transformations before it will occur in order leading up to the action.

Re: Understanding accumulator during transformations

2015-06-24 Thread Burak Yavuz
Hi Wei, For example, when a straggler executor gets killed in the middle of a map operation and it's task is restarted at a different instance, the accumulator will be updated more than once. Best, Burak On Wed, Jun 24, 2015 at 1:08 PM, Wei Zhou zhweisop...@gmail.com wrote: Quoting from Spark

Re: Understanding accumulator during transformations

2015-06-24 Thread Burak Yavuz
Hi Wei, During the action, all the transformations before it will occur in order leading up to the action. If you have an accumulator in any of these transformations, then you won't get exactly once semantics, because the transformation may be restarted elsewhere. Bet, Burak On Wed, Jun 24,

[sparksql] sparse floating point data compression in sparksql cache

2015-06-24 Thread Nikita Dolgov
When my 22M Parquet test file ended up taking 3G when cached in-memory I looked closer at how column compression works in 1.4.0. My test dataset was 1,000 columns * 800,000 rows of mostly empty floating point columns with a few dense long columns. I was surprised to see that no real

Re: [sparksql] sparse floating point data compression in sparksql cache

2015-06-24 Thread Michael Armbrust
Have you considered instead using the mllib SparseVector type (which is supported in Spark SQL?) On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov n...@beckon.com wrote: When my 22M Parquet test file ended up taking 3G when cached in-memory I looked closer at how column compression works in

How to run kmeans.py Spark example in yarn-cluster ?

2015-06-24 Thread Elkhan Dadashov
Hi all, I'm trying to run kmeans.py Spark example on Yarn cluster mode. I'm using Spark 1.4.0. I'm passing numpy-1.9.2.zip with --py-files flag. Here is the command I'm trying to execute but it fails: ./bin/spark-submit --master yarn-cluster --verbose --py-files

HiveContext /Spark much slower than Hive

2015-06-24 Thread afarahat
I have a simple HQL (below). In hive it takes maybe 10 minutes to complete. When I do this with Spark it seems to take for every. The table is partitioned by datestamp. I am using Spark 1.3.1 How can i tune/optimize here is the query tumblruser=hiveCtx.sql( select s_mobile_id, receive_time

Re: Does HiveContext connect to HiveServer2?

2015-06-24 Thread Nitin kak
Hi Marcelo, The issue does not happen while connecting to the hive metstore, that works fine. It seems that HiveContext only uses Hive CLI to execute the queries while HiveServer2 does not support it. I dont think you can specify any configuration in hive-site.xml which can make it connect to

Re:

2015-06-24 Thread Akhil Das
Can you look a bit more in the error logs? It could be getting killed because of OOM etc. One thing you can try is to set the spark.shuffle.blockTransferService to nio from netty. Thanks Best Regards On Wed, Jun 24, 2015 at 5:46 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a Spark job

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Akhil Das
Can you try to add those jars in the SPARK_CLASSPATH and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was

Re: kafka spark streaming with mesos

2015-06-24 Thread Akhil Das
A screenshot of your framework running would also be helpful. How many cores does it have? Did you try running it in coarse grained mode? Try to add these to the conf: sparkConf.set(spark.mesos.coarse, true) sparkConfset(spark.cores.max, 2) Thanks Best Regards On Wed, Jun 24, 2015 at 1:35 AM,

Loss of data due to congestion

2015-06-24 Thread anshu shukla
How spark guarantees that no RDD will fail /lost during its life cycle . Is there something like ask in storm or its does it by default . -- Thanks Regards, Anshu Shukla

Re: Nested DataFrame(SchemaRDD)

2015-06-24 Thread Richard Catlin
Michael, I have two Dataframes. A users DF, and an investments DF. The investments DF has a column that matches the users id. I would like to nest the collection of investments for each user and save to a parquet file. Is there a straightforward way to do this? Thanks. Richard Catlin On

Re: Loss of data due to congestion

2015-06-24 Thread ayan guha
Can you elaborate little more? Are you talking about receiver or streaming? On 24 Jun 2015 23:18, anshu shukla anshushuk...@gmail.com wrote: How spark guarantees that no RDD will fail /lost during its life cycle . Is there something like ask in storm or its does it by default . -- Thanks

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-24 Thread Sean Owen
You didn't provide any error? You're compiling vs Hive 1.1 here and that is the problem. It is nothing to do with CDH. On Wed, Jun 24, 2015, 10:15 PM Aaron aarongm...@gmail.com wrote: I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling with the v1.4.0 tag out of git?

Re: Spark ec2 cluster lost worker

2015-06-24 Thread Kelly, Jonathan
Yeah, sorry, I didn't really answer your question due to my bias for EMR. =P Unfortunately, also due to my bias, I have not actually tried using a straight EC2 cluster as opposed to Spark on EMR. I'm not sure how the slave nodes get set up when running Spark on EC2, but I would imagine that it

Re: which mllib algorithm for large multi-class classification?

2015-06-24 Thread Danny Linden
Hi, here the Stack trace, thx for every help: 15/06/24 23:15:26 INFO DAGScheduler: Submitting ShuffleMapStage 19 (MapPartitionsRDD[49] at treeAggregate at LBFGS.scala:218), which has no missing parents [error] (dag-scheduler-event-loop) java.lang.OutOfMemoryError: Requested array size exceeds

Re: Spark ec2 cluster lost worker

2015-06-24 Thread Kelly, Jonathan
Just curious, would you be able to use Spark on EMR rather than on EC2? Spark on EMR will handle lost nodes for you, and it will let you scale your cluster up and down or clone a cluster (its config, that is, not the data stored in HDFS), among other things. We also recently announced official

Re: java.lang.OutOfMemoryError: PermGen space

2015-06-24 Thread Srikanth
That worked. Thanks! I wonder what changed in 1.4 to cause this. It wouldn't work with anything less than 256m for a simple piece of code. 1.3.1 used to work with default(64m I think) Srikanth On Wed, Jun 24, 2015 at 12:47 PM, Roberto Coluccio roberto.coluc...@gmail.com wrote: Did you try to

Re: how to increase parallelism ?

2015-06-24 Thread ๏̯͡๏
Yes these data sets are in HDFs. Earlier that task completed in 25 mins. Now its 15 + 20 On Wed, Jun 24, 2015 at 3:16 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Yes, it will introduce a shuffle stage in order to perform the repartitioning. So it’s more useful if you’re

Spark ec2 cluster lost worker

2015-06-24 Thread anny9699
Hi, According to the Spark UI, one worker is lost after a failed job. It is not a lost executor error, but that the UI now only shows 8 workers (I have 9 workers). However from the ec2 console, it shows the machine is running and no check alarms. So I am confused how I could reconnect the lost

Re: Spark ec2 cluster lost worker

2015-06-24 Thread Anny Chen
Hi Jonathan, Thanks for this information! I will take a look into it. However is there a way to reconnect the lost node? Or there's no way that I could do to find back the lost worker? Thanks! Anny On Wed, Jun 24, 2015 at 6:06 PM, Kelly, Jonathan jonat...@amazon.com wrote: Just curious, would

Nesting DataFrames and saving to Parquet

2015-06-24 Thread Richard Catlin
I have two Dataframes. A users DF, and an investments DF. The investments DF has a column that matches the users id. I would like to nest the collection of investments for each user and save to a parquet file. Is there a straightforward way to do this? Thanks. Richard Catlin

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Yin Huai
The function accepted by explode is f: Row = TraversableOnce[A]. Seems user_mentions is an array of structs. So, can you change your pattern matching to the following? case Row(rows: Seq[_]) = rows.asInstanceOf[Seq[Row]].map(elem = ...) On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones

Re: Velox Model Server

2015-06-24 Thread Debasish Das
Thanks Nick, Sean for the great suggestions... Since you guys have already hit these issues before I think it will be great if we can add the learning to Spark Job Server and enhance it for community. Nick, do you see any major issues in using Spray over Scalatra ? Looks like Model Server API

Re: Spark stream test throw org.apache.spark.SparkException: Task not serializable when execute in spark shell

2015-06-24 Thread Yana Kadiyska
I can't tell immediately, but you might be able to get more info with the hint provided here: http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator (short version, set -Dsun.io.serialization.extendedDebugInfo=true) Also, unless you're simplifying your

java.lang.OutOfMemoryError: PermGen space

2015-06-24 Thread stati
Hello, I moved from 1.3.1 to 1.4.0 and started receiving java.lang.OutOfMemoryError: PermGen space when I use spark-shell. Same Scala code works fine in 1.3.1 spark-shell. I was loading same set of external JARs and have same imports in 1.3.1. I tried increasing perm size to 256m. I still got

SparkR parallelize not found with 1.4.1?

2015-06-24 Thread Felix C
Hi, It must be something very straightforward... Not working: parallelize(sc) Error: could not find function parallelize Working: df - createDataFrame(sqlContext, localDF) What did I miss? Thanks

Re: WorkFlow Processing - Spark

2015-06-24 Thread ayan guha
As long as the logic can be run in parallel, yes. You should not however load any logic in driver. All logic should run in executors. On 25 Jun 2015 07:58, asoni.le...@gmail.com wrote: Any custom script ( python or java or scala ) Thanks , Ashish On Jun 24, 2015, at 4:39 PM, ayan guha

Re: Loss of data due to congestion

2015-06-24 Thread anshu shukla
Thaks, I am talking about streaming. On 25 Jun 2015 5:37 am, ayan guha guha.a...@gmail.com wrote: Can you elaborate little more? Are you talking about receiver or streaming? On 24 Jun 2015 23:18, anshu shukla anshushuk...@gmail.com wrote: How spark guarantees that no RDD will fail /lost

Re: java.lang.OutOfMemoryError: PermGen space

2015-06-24 Thread Roberto Coluccio
Did you try to pass it with --driver-java-options -XX:MaxPermSize=256m as spark-shell input argument? Roberto On Wed, Jun 24, 2015 at 5:57 PM, stati srikanth...@gmail.com wrote: Hello, I moved from 1.3.1 to 1.4.0 and started receiving java.lang.OutOfMemoryError: PermGen space when I

dateTime functionality

2015-06-24 Thread hbutani
Hi, I have just uploaded a spark package for dateTime expressions: https://github.com/SparklineData/spark-datetime. It exposes functions on DateTime, Period arithmetic and Intervals in sql and provides a simple dsl to build catalyst expressions about dateTime. A date StringContext lets you embed

bugs in Spark PageRank implementation

2015-06-24 Thread Kelly, Terence P (HP Labs Researcher)
Hi, Colleagues and I have found that the PageRank implementation bundled with Spark is incorrect in several ways. The code in question is in Apache Spark 1.2 distribution's examples directory, called SparkPageRank.scala. Consider the example graph presented in the colorful figure on the

Debugging Apache Spark clustered application from Eclipse

2015-06-24 Thread nitinkalra2000
I am trying to debug Spark application running on eclipse in clustered/distributed environment but not able to succeed. Application is java based and I am running it through Eclipse. Configurations to spark for Master/worker is provided through Java only. Though I can debug the code on driver

Re: bugs in Spark PageRank implementation

2015-06-24 Thread Tarek Auel
Hi Terence, which implementation are you using? I tested it and the results look very good id --- result value -percentage --- percentage (wikipedia) 2: 3.5658816369034536 (38.43986817970977 %), 38.4% 3: 3.1809909923039688 (34.29078328331496 %), 34.3% 5: 0.7503491964913347

Parsing a tsv file with key value pairs

2015-06-24 Thread Ravikant Dindokar
Hi Spark user, I am new to spark so forgive me for asking a basic question. I'm trying to import my tsv file into spark. This file has key and value separated by a \t per line. I want to import this file as dictionary of key value pairs in Spark. I came across this code to do the same for csv

WorkFlow Processing - Spark

2015-06-24 Thread Ashish Soni
Hi All , We are looking to use spark as our stream processing framework and it would be helpful if experts can weigh if we made a right choice given below requirement Given a stream of data we need to take those event to multiple stage ( pipeline processing ) and in those stage customer will

Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Sandy Ryza
Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number and spark.scheduler.minRegisteredResourcesRatio to 1.0. -Sandy On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Jun 2015, at 05:55, canan

Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-24 Thread maxdml
Basically, here's a dump of the SO question I opened (http://stackoverflow.com/questions/31033724/spark-1-4-0-java-lang-nosuchmethoderror-com-google-common-base-stopwatch-elapse) I'm using spark 1.4.0 and when running the Scala SparkPageRank example

Re:

2015-06-24 Thread ๏̯͡๏
Its running now. On Wed, Jun 24, 2015 at 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Now running with *--num-executors 9973 --driver-memory 14g --driver-java-options -XX:MaxPermSize=512M -Xmx4096M -Xms4096M --executor-memory 14g --executor-cores 1* On Wed, Jun 24, 2015 at 10:34

Re: When to use underlying data management layer versus standalone Spark?

2015-06-24 Thread Sandy Ryza
Hi Michael, Spark itself is an execution engine, not a storage system. While it has facilities for caching data in memory, think about these the way you would think about a process on a single machine leveraging memory - the source data needs to be stored somewhere, and you need to be able to

Re:

2015-06-24 Thread Akhil Das
Cool. :) On 24 Jun 2015 23:44, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Its running now. On Wed, Jun 24, 2015 at 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Now running with *--num-executors 9973 --driver-memory 14g --driver-java-options -XX:MaxPermSize=512M -Xmx4096M -Xms4096M

Re:

2015-06-24 Thread ๏̯͡๏
I see this java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:2694) at java.lang.String.init(String.java:203) at java.lang.StringBuilder.toString(StringBuilder.java:405) at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108) at

Re:

2015-06-24 Thread ๏̯͡๏
There are multiple of these 1) 15/06/24 09:53:37 ERROR executor.Executor: Exception in task 443.0 in stage 3.0 (TID 1767) java.lang.OutOfMemoryError: GC overhead limit exceeded at sun.reflect.GeneratedSerializationConstructorAccessor1327.newInstance(Unknown Source) at

how to increase parallelism ?

2015-06-24 Thread ๏̯͡๏
I have a filter.map that triggers 170 tasks. How can i increase it ? Code: val viEvents = details.filter(_.get(14).asInstanceOf[Long] != NULL_VALUE).map { vi = (vi.get(14).asInstanceOf[Long], vi) } Deepak

Re:

2015-06-24 Thread ๏̯͡๏
Now running with *--num-executors 9973 --driver-memory 14g --driver-java-options -XX:MaxPermSize=512M -Xmx4096M -Xms4096M --executor-memory 14g --executor-cores 1* On Wed, Jun 24, 2015 at 10:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: There are multiple of these 1) 15/06/24 09:53:37

Spark Python process

2015-06-24 Thread Justin Steigel
I have a spark job that's running on a 10 node cluster and the python process on all the nodes is pegged at 100%. I was wondering what parts of a spark script are run in the python process and which get passed to the Java processes? Is there any documentation on this? Thanks, Justin

Spark SQL incompatible with Apache Sentry(Cloudera bundle)

2015-06-24 Thread nitinkak001
CDH version: 5.3 Spark Version: 1.2 I was trying to execute a Hive query from Spark code(using HiveContext class). It was working fine untill we installed Apache Sentry. Now its giving me read permission exception. /org.apache.hadoop.security.AccessControlException: Permission denied:

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Yana Kadiyska
Thanks, that did seem to make a difference. I am a bit scared of this approach as spark itself has a different guava dependency but the error does go away this way On Wed, Jun 24, 2015 at 10:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try to add those jars in the SPARK_CLASSPATH

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Michael Armbrust
Starting in Spark 1.4 there is also an explode that you can use directly from the select clause (much like in HiveQL): import org.apache.spark.sql.functions._ df.select(explode($entities.user_mentions).as(mention)) Unlike standard HiveQL, you can also include other attributes in the select or

Re:

2015-06-24 Thread ๏̯͡๏
Its taking an hour and on Hadoop it takes 1h 30m, is there a way to make it run faster ? On Wed, Jun 24, 2015 at 11:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Cool. :) On 24 Jun 2015 23:44, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Its running now. On Wed, Jun 24, 2015 at 10:45 AM,

Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-24 Thread Aaron
I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling with the v1.4.0 tag out of git? SparkSQL keeps dying on me and not 100% why. I modified the pom.xml to mak a simple profile to help: profile idcdh542/id properties java.version1.7/java.version

Re: how to increase parallelism ?

2015-06-24 Thread Silvio Fiorito
Hi Deepak, Parallelism is controlled by the number of partitions. In this case, how many partitions are there for the details RDD (likely 170). You can check by running “details.partitions.length”. If you want to increase parallelism you can do so by repartitioning, increasing the number of

com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-24 Thread Piero Cinquegrana
Hello Spark Experts, I am facing the following issue. 1) I am converting a org.apache.spark.sql.Row into org.apache.spark.mllib.linalg.Vectors using sparse notation 2) After the parsing proceeds successfully I try to look at the result and I get the following error:

Understanding accumulator during transformations

2015-06-24 Thread Wei Zhou
Quoting from Spark Program guide: For accumulator updates performed inside *actions only*, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update