Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Ted Yu
Since both scala and java files are involved in the PR, I don't see an easy way around without building yourself. Cheers On Wed, Dec 16, 2015 at 10:18 AM, Saiph Kappa wrote: > Exactly, but it's only fixed for the next spark version. Is there any work > around for version

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
To follow up with your other issue, if you are just trying to count elements in a DStream, you can do that without an Accumulator. foreachRDD is meant to be an output action, it does not return anything and it is actually run in the driver program. Because Java (before 8) handles closures a

Scala VS Java VS Python

2015-12-16 Thread Daniel Valdivia
Hello, This is more of a "survey" question for the community, you can reply to me directly so we don't flood the mailing list. I'm having a hard time learning Spark using Python since the API seems to be slightly incomplete, so I'm looking at my options to start doing all my apps in either

Parquet datasource optimization for distinct query

2015-12-16 Thread pnpritchard
I have a parquet file that is partitioned by a column, like shown in http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery. I like this storage technique because the datasource can "push-down" filters on the partitioned column, making some queries a lot faster.

count(*) performance in Hive vs Spark DataFrames

2015-12-16 Thread Christopher Brady
I'm having an issue where count(*) returns almost immediately using Hive, but takes over 10 min using DataFrames. The table data is on HDFS in an uncompressed CSV format. How is it possible for Hive to get the count so fast? Is it caching this or putting it in the metastore? Is there anything

Re: Scala VS Java VS Python

2015-12-16 Thread Daniel Lopes
For me Scala is better like Spark is written in Scala, and I like python cuz I always used python for data science. :) On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia wrote: > Hello, > > This is more of a "survey" question for the community, you can reply to me >

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
When you re-run the last statement a second time, does it work? Could it be related to https://issues.apache.org/jira/browse/SPARK-12350 ? On 16 December 2015 at 10:39, Ted Yu wrote: > Hi, > I used the following command on a recently refreshed checkout of master > branch: >

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
Yeah, the same kind of error actually happens in the JIRA. It actually succeeds but a load of exceptions are thrown. Subsequent runs don't produce any errors anymore. On 16 December 2015 at 10:55, Ted Yu wrote: > The first run actually worked. It was the amount of

Re: Can't create UDF through thriftserver, no error reported

2015-12-16 Thread Antonio Piccolboni
Mmmmh, I may have been stuck with some stale lib. A complete reset of the client (when by complete I mean beyond what reason would require) solved the problem. I think we can consider this solved unless new evidence appears. Thanks! On Wed, Dec 16, 2015 at 9:16 AM Antonio Piccolboni

Setting the vote rate in a Random Forest in MLlib

2015-12-16 Thread Young, Matthew T
One of our data scientists is interested in using Spark to improve performance in some random forest binary classifications, but isn't getting good enough results from MLlib's implementation of the random forest compared to R's randomforest library with the available parameters. She suggested

File not found error running query in spark-shell

2015-12-16 Thread Ted Yu
Hi, I used the following command on a recently refreshed checkout of master branch: ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests I was then running simple query in spark-shell: Seq( (83, 0, 38), (26, 0,

Re: HiveContext Self join not reading from cache

2015-12-16 Thread Ted Yu
I did the following exercise in spark-shell ("c" is cached table): scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain == Physical Plan == Project [b#4] +- BroadcastHashJoin [a#3], [a#125], BuildRight :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation

Re: Can't create UDF through thriftserver, no error reported

2015-12-16 Thread Antonio Piccolboni
Hi Jeff, the ticket is certainly relevant, thanks for digging it out, but as I said I can repro in 1.6.0-rc2. Will try again just to make sure. On Tue, Dec 15, 2015 at 5:17 PM Jeff Zhang wrote: > It should be resolved by this ticket >

Re: Error while running a job in yarn-client mode

2015-12-16 Thread Ted Yu
Do you mind sharing code snippet so that we have more clue ? Thanks On Wed, Dec 16, 2015 at 8:37 AM, sunil m <260885smanik...@gmail.com> wrote: > Hello Spark experts! > > > I am using spark 1.5.1 and get the following exception while running > sample applications > > Any tips/ hints on how to

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Saiph Kappa
Exactly, but it's only fixed for the next spark version. Is there any work around for version 1.5.2? On Wed, Dec 16, 2015 at 4:36 PM, Ted Yu wrote: > This seems related: > [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration > > FYI > > On Wed, Dec 16,

Re: File not found error running query in spark-shell

2015-12-16 Thread Ted Yu
The first run actually worked. It was the amount of exceptions preceding the result that surprised me. I want to see if there is a way of getting rid of the exceptions. Thanks On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky wrote: > When you re-run the last statement a

HiveContext Self join not reading from cache

2015-12-16 Thread Gourav Sengupta
Hi, This is how the data can be created: 1. TableA : cached() 2. TableB : cached() 3. TableC: TableA inner join TableB cached() 4. TableC join TableC does not take the data from cache but starts reading the data for TableA and TableB from disk. Does this sound like a bug? The self join between

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Daniel Valdivia
Hi Abhishek, Thanks for your suggestion, I did considered it, but I'm not sure if to achieve that I'd ned to collect() the data first, I don't think it would fit into the Driver memory. Since I'm trying all of this inside the pyspark shell I'm using a small dataset, however the main dataset is

Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Igor Berman
check version compatibility I think avro lib should be 1.7.4 check that no other lib brings transitive dependency of other avro version On 16 December 2015 at 09:44, Jinyuan Zhou wrote: > Hi, I tried to load avro files in hdfs but keep getting NPE. > I am using

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread abhisheksgumadi
Hi Daniel Yes it will work without the collect method. You just do a map operation on every item of the RDD. Thanks Abhishek S > On 16 Dec 2015, at 18:10, Daniel Valdivia wrote: > > Hi Abhishek, > > Thanks for your suggestion, I did considered it, but I'm not

Re: Preventing an RDD from shuffling

2015-12-16 Thread Igor Berman
imho, you should implement your own rdd with mongo sharding awareness, then this rdd will have this mongo aware partitioner, and then incoming data will be partitioned by this partitioner in join not sure if it's simple task...but you have to get partitioner in you mongo rdd. On 16 December 2015

Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Ted Yu
>From Spark's root pom.xml : 1.7.7 FYI On Wed, Dec 16, 2015 at 3:06 PM, Igor Berman wrote: > check version compatibility > I think avro lib should be 1.7.4 > check that no other lib brings transitive dependency of other avro version > > > On 16 December 2015 at

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Todd Nist
Another possible alternative is to register a StreamingListener and then reference the BatchInfo.numRecords; good example here, https://gist.github.com/akhld/b10dc491aad1a2007183. After registering the listener, Simply implement the appropriate "onEvent" method where onEvent is onBatchStarted,

Re: Scala VS Java VS Python

2015-12-16 Thread Gary Struthers
“Learning Spark” code examples are in Scala, Java, and Python and is an excellent book both for learning Spark and for showing just enough Scala to program Spark. http://shop.oreilly.com/product/0636920028512.do Then compare the api for a

Re: Scala VS Java VS Python

2015-12-16 Thread Stephen Boesch
There are solid reasons to have built spark on the jvm vs python. The question for Daniel appear to be at this point scala vs java8. For that there are many comparisons already available: but in the case of working with spark there is the additional benefit for the scala side that the core

Re: Kafka - streaming from multiple topics

2015-12-16 Thread jpocalan
Nevermind, I found the answer to my questions. The following spark configuration property will allow you to process multiple KafkaDirectStream in parallel: --conf spark.streaming.concurrentJobs= -- View this message in context:

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
For future reference, this should be fixed with PR #10337 ( https://github.com/apache/spark/pull/10337) On 16 December 2015 at 11:01, Jakob Odersky wrote: > Yeah, the same kind of error actually happens in the JIRA. It actually > succeeds but a load of exceptions are thrown.

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-16 Thread Bartłomiej Alberski
First of all , thanks @tdas for looking into my problem. Yes, I checked it seperately and it is working fine. For below piece of code there is no single exception and values are sent correctly. val reporter = new MyClassReporter(...) reporter.send(...) val out = new

Re: Scala VS Java VS Python

2015-12-16 Thread Darren Govoni
I use python too. I'm actually surprises it's not the primary language since it is by far more used in data science than java snd Scala combined. If I had a second choice of script language for general apps I'd want groovy over scala. Sent from my Verizon Wireless 4G LTE smartphone

Re: Scala VS Java VS Python

2015-12-16 Thread Lan Jiang
For Spark data science project, Python might be a good choice. However, for Spark streaming, Python API is still lagging. For example, for Kafka no receiver connector, according to the Spark 1.5.2 doc: "Spark 1.4 added a Python API, but it is not yet at full feature parity”. Java does not

RE: ideal number of executors per machine

2015-12-16 Thread Bui, Tri
Article below gives a good idea. http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Play around with two configuration (large number of executor with small core, and small executor with large core) . Calculated value have to be conservative or it will make the

Re: Scala VS Java VS Python

2015-12-16 Thread Veljko Skarich
I use Scala, since I was most familiar with it out of the three languages, when I started using Spark. I would say that learning Scala with no functional programming background is somewhat challenging, but well worth it if you have the time. As others have pointed out, using the REPL and

Re: PySpark Connection reset by peer: socket write error

2015-12-16 Thread Surendran Duraisamy
I came across this issue while running the program in my lapop with a small data set (around 3.5 MB). Code is straight forward as follows. data = sc.textFile("inputfile.txt") mappedRdd = data.map(*mapFunction*).cache() model = ALS.train(mappedRdd , 10, 15) ... mapFunction - is a simple map

MLlib: Feature Importances API

2015-12-16 Thread Asim Jalis
I wanted to use get feature importances related to a Random Forest as described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133 However, I don’t see how to call this. I don't see any methods exposed on org.apache.spark.mllib.tree.RandomForest How can I get featureImportances when

spark master process shutdown for timeout

2015-12-16 Thread yaoxiaohua
Hi guys, I have two nodes used as spark master, spark1,spark2 Spark1.4.0 Jdk 1.7 sunjdk Now these days I found that spark2 master process may shutdown , I found that in log file: 15/12/17 13:09:58 INFO ClientCnxn: Client session timed out, have not heard from server in

Error getting response from spark driver rest APIs : java.lang.IncompatibleClassChangeError: Implementing class

2015-12-16 Thread ihavethepotential
Hi all, I am trying to get the job details for my spark application using a REST call to the driver API. I am making a GET request to the following URI /api/v1/applications/socketEs2/jobs But getting the following exception: 015-12-16 19:46:28 qtp1912556493-56 [WARN ] ServletHandler -

Access row column by field name

2015-12-16 Thread Daniel Valdivia
Hi, I'm processing the json I have in a text file using DataFrames, however right now I'm trying to figure out a way to access a certain value within the rows of my data frame if I only know the field name and not the respective field position in the schema. I noticed that row.schema and

Re: Preventing an RDD from shuffling

2015-12-16 Thread Koert Kuipers
a join needs a partitioner, and will shuffle the data as needed for the given partitioner (or if the data is already partitioned then it will leave it alone), after which it will process with something like a map-side join. if you can specify a partitioner that meets the exact layout of your data

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Jacek Laskowski
Thanks Mark for the answer! It helps, but still leaves me with few more questions. If you don't mind, I'd like to ask you few more questions. When you said "It can be used, and is used in user code, but it isn't always as straightforward as you might think." did you think about the Spark code or

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM,

Spark job dying when I submit through oozie

2015-12-16 Thread Scott Gallavan
I am trying to submit a spark job through oozie, the job is marked successful, but it does not do anything. I have it working through spark_submit on the command line. Also the problem may be around creating the spark context, because I added logging before/after/finally creating the

Re: PySpark Connection reset by peer: socket write error

2015-12-16 Thread Vijay Gharge
Can you elaborate your problem further ? Looking at the error looks like you are running on cluster. Also share relevant code for better understanding. On Wednesday 16 December 2015, Surendran Duraisamy wrote: > Hi, > > > > I am running ALS to train a data set of around

Re: Adding a UI Servlet Filter

2015-12-16 Thread Ted Yu
Which Spark release are you using ? Mind pasting stack trace ? > On Dec 14, 2015, at 11:34 AM, iamknome wrote: > > Hello all, > > I am trying to setup a UI filter for the Web UI and trying to add my > customer auth servlet filter to the worker and master processes. I have

Re: Access row column by field name

2015-12-16 Thread Jeff Zhang
use Row.getAs[String](fieldname) On Thu, Dec 17, 2015 at 10:58 AM, Daniel Valdivia wrote: > Hi, > > I'm processing the json I have in a text file using DataFrames, however > right now I'm trying to figure out a way to access a certain value within > the rows of my data

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Sean Owen
It does look like it's not actually used. It may simply be there for completeness, to match cancelStage and cancelJobGroup, which are used. I also don't know of a good reason there's no way to kill a whole job. On Wed, Dec 16, 2015 at 1:15 PM, Jacek Laskowski wrote: > Hi, > >

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Jeff Zhang
oh, you are using S3. As I remember, S3 has performance issue when processing large amount of files. On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta wrote: > The HIVE table has very large number of partitions around 365 * 5 * 10 and > when I say hivemetastore to

Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Abhishek Shivkumar
Hello Daniel, I was thinking if you can write catGroupArr.map(lambda line: create_and_write_file(line)) def create_and_write_file(line): 1. look at the key of line: line[0] 2. Open a file with required file name based on key 3. iterate through the values of this key,value pair

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
Hi Jeff, sadly that does not resolve the issue. I am sure that the memory mapping to physical files locations can be saved and recovered in SPARK. Regards, Gourav Sengupta On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang wrote: > oh, you are using S3. As I remember, S3 has

PySpark Connection reset by peer: socket write error

2015-12-16 Thread Surendran Duraisamy
Hi, I am running ALS to train a data set of around 15 lines in my local machine. When I call train I am getting following exception. *print *mappedRDDs.count() # this prints correct RDD count model = ALS.train(mappedRDDs, 10, 15) 15/12/16 18:43:18 ERROR PythonRDD: Python worker exited

Re: ideal number of executors per machine

2015-12-16 Thread Sean Owen
I don't think it has anything to do with using all the cores, since 1 executor can run as many tasks as you like. Yes, you'd want them to request all cores in this case. YARN vs Mesos does not matter here. On Wed, Dec 16, 2015 at 1:58 PM, Michael Segel wrote: > Hmmm. >

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Ted Yu
This seems related: [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration FYI On Wed, Dec 16, 2015 at 7:31 AM, Saiph Kappa wrote: > Hi, > > I have a client application running on host0 that is launching multiple > drivers on multiple remote standalone

Trying to index document in Solr with Spark and solr-spark library

2015-12-16 Thread Guillermo Ortiz
I'm trying to index document to Solr from Spark with the library solr-spark I have create a project with Maven and include all the dependencies when I execute spark but I get a ClassNotFoundException. I have check that the class is in one of the jar that I'm including ( solr-solrj-4.10.3.jar) I

Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Saiph Kappa
Hi, I have a client application running on host0 that is launching multiple drivers on multiple remote standalone spark clusters (each cluster is running on a single machine): « ... List("host1", "host2" , "host3").foreach(host => { val sparkConf = new SparkConf() sparkConf.setAppName("App")

Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Bryan Jeffrey
I had a bunch of library dependencies that were still using Scala 2.10 versions. I updated them to 2.11 and everything has worked fine since. On Wed, Dec 16, 2015 at 3:12 AM, Ashwin Sai Shankar wrote: > Hi Bryan, > I see the same issue with 1.5.2, can you pls let me know

WholeTextFile for 8000~ files - problem

2015-12-16 Thread Eran Witkon
Hi, I have about 8K files on about 10 directories on hdfs and I need to add a column to all files with the file name (e.g. file1.txt adds a column with file1.txt, file 2 with "file2.txt" etc) The current approach was to read all files using *sc.WholeTextFiles("myPath") *and have the file name as

Re: Preventing an RDD from shuffling

2015-12-16 Thread PhuDuc Nguyen
There is a way and it's called "map-side-join". To be clear, there is no explicit function call/API to execute a map-side-join. You have to code it using a local/broadcast value combined with the map() function. A caveat for this to work is that one side of the join must be small-ish to exist as a

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
The HIVE table has very large number of partitions around 365 * 5 * 10 and when I say hivemetastore to start running queries on it (the one with .count() or .show()) then it takes around 2 hours before the job starts in SPARK. On the pyspark screen I can see that it is parsing the S3 locations

SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Jacek Laskowski
Hi, While reviewing Spark code I came across SparkContext.cancelJob. I found no part of Spark using it. Is this a leftover after some refactoring? Why is this part of sc? The reason I'm asking is another question I'm having after having learnt about killing a stage in webUI. I noticed there is a

Re: Trying to index document in Solr with Spark and solr-spark library

2015-12-16 Thread Guillermo Ortiz
I have found a error more specific in an executor. The another error is happening in the Driver. I have navigate in Zookeeper in the collection is created. Anyway, it seems more a problem with Solr than Spark right now. 2015-12-16 16:31:43,923 [Executor task launch worker-1] INFO

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Mark Hamstra
It can be used, and is used in user code, but it isn't always as straightforward as you might think. This is mostly because a Job often isn't a Job -- or rather it is more than one Job. There are several RDD transformations that aren't lazy, so they end up launching "hidden" Jobs that you may

Re: Spark-shell connecting to Mesos stuck at sched.cpp

2015-12-16 Thread Aaron
Found this thread that talked about it to help understand it better: https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%3ccajq68qf9pejgnwomasm2dqchyaxpcaovnfkfgggxxpzj2jo...@mail.gmail.com%3E > > When you run Spark on Mesos it needs to run > > spark driver > mesos scheduler > >

Using Spark to process JSON with gzip filed

2015-12-16 Thread Eran Witkon
Hi, I have a few JSON files in which one of the field is a binary filed - this field is the output of running GZIP of a JSON stream and compressing it to the binary field. Now I want to de-compress the field and get the outpur JSON. I was thinking of running map operation and passing a function

SparkEx PiAverage: Re: How to meet nested loop on pairRdd?

2015-12-16 Thread MegaLearn
I am wondering about the same concept as the OP, did anyone have an answer for this question? I can't see that Spark has loops built in, except to loop over a dataset of existing/known size. Thus I often create a "dummy" ArrayList and pass it to parallelize to control how many times Spark will run

Error while running a job in yarn-client mode

2015-12-16 Thread sunil m
Hello Spark experts! I am using spark 1.5.1 and get the following exception while running sample applications Any tips/ hints on how to solve the error below will be of great help! _ *Exception in thread "main"

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
Hi Andy, Regarding the foreachrdd return value, this Jira that will be in 1.6 should take care of that https://issues.apache.org/jira/browse/SPARK-4557 and make things a little simpler. On Dec 15, 2015 6:55 PM, "Andy Davidson" wrote: > I am writing a JUnit test

Re: Compiling spark 1.5.1 fails with scala.reflect.internal.Types$TypeError: bad symbolic reference.

2015-12-16 Thread Simon Hafner
It happens with 2.11, you'll have to do both: ./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package you get that error if you forget one IIRC. 2015-12-05 20:17 GMT+08:00 MrAsanjar . : > Simon, I am getting the same error, how did

Re: Spark big rdd problem

2015-12-16 Thread Eran Witkon
I run the yarn log command and got the following: A set of yarnAllocator warnings 'expected to find requests, but found none.' Then an error: Akka. ErrorMonitor: associationError ... But then I still get final app status: Succeeded, exit code 0 What does these errors mean? On Wed, 16 Dec 2015 at

Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Ashwin Sai Shankar
Hi Bryan, I see the same issue with 1.5.2, can you pls let me know what was the resolution? Thanks, Ashwin On Fri, Nov 20, 2015 at 12:07 PM, Bryan Jeffrey wrote: > Nevermind. I had a library dependency that still had the old Spark version. > > On Fri, Nov 20, 2015 at

Preventing an RDD from shuffling

2015-12-16 Thread sparkuser2345
Is there a way to prevent an RDD from shuffling in a join operation without repartitioning it? I'm reading an RDD from sharded MongoDB, joining that with an RDD of incoming data (+ some additional calculations), and writing the resulting RDD back to MongoDB. It would make sense to shuffle only