Unable to use "Batch Start Time" on worker nodes.

2015-11-26 Thread Abhishek Anand
Hi ,

I need to use batch start time in my spark streaming job.

I need the value of batch start time inside one of the functions that is
called within a flatmap function in java.

Please suggest me how this can be done.

I tried to use the StreamingListener class and set the value of a variable
inside the onBatchSubmitted function something like this :

public void onBatchSubmitted(StreamingListenerBatchSubmitted
batchSubmitted) { batchstarttime =
batchSubmitted.batchInfo().batchTime().milliseconds();
  CommandLineArguments.BATCH_START_TIME = batchstarttime;
 }


But, the issue is that the BATCH_START_TIME set only when the batch starts.
I see in the worker logs that BATCH_START_TIME takes the default value and
is not set.


Please suggest how this can be achieved.



BR,
Abhi


Grid search with Random Forest

2015-11-26 Thread Ndjido Ardo Bar

Hi folks,

Does anyone know whether the Grid Search capability is enabled since the issue 
spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol column doesn't 
exist" when trying to perform a grid search with Spark 1.4.0.

Cheers,
Ardo 




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help with Couchbase connector error

2015-11-26 Thread Shixiong Zhu
Het Eyal, I just checked the couchbase spark connector jar. The target
version of some of classes are Java 8 (52.0). You can create a ticket in
https://issues.couchbase.com/projects/SPARKC

Best Regards,
Shixiong Zhu

2015-11-26 9:03 GMT-08:00 Ted Yu :

> StoreMode is from Couchbase connector.
>
> Where did you obtain the connector ?
>
> See also
> http://stackoverflow.com/questions/1096148/how-to-check-the-jdk-version-used-to-compile-a-class-file
>
> On Thu, Nov 26, 2015 at 8:55 AM, Eyal Sharon  wrote:
>
>> Hi ,
>> Great , that gave some directions. But can you elaborate more?  or share
>> some post
>> I am currently running JDK 7 , and  my Couchbase too
>>
>> Thanks !
>>
>> On Thu, Nov 26, 2015 at 6:02 PM, Ted Yu  wrote:
>>
>>> This implies version mismatch between the JDK used to build your jar and
>>> the one at runtime.
>>>
>>> When building, target JDK 1.7
>>>
>>> There're plenty of posts on the web for dealing with such error.
>>>
>>> Cheers
>>>
>>> On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon  wrote:
>>>
 Hi,

 I am trying to set a connection to Couchbase. I am at the very
 beginning, and I got stuck on   this exception

 Exception in thread "main" java.lang.UnsupportedClassVersionError:
 com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0


 Here is the simple code fragment

   val sc = new SparkContext(cfg)

   val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", 
 "content"))
   val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", 
 "content", "in", "here"))


   val data = sc
 .parallelize(Seq(doc1, doc2))
 .saveToCouchbase()
 }


 Any help will be a bless


 Thanks!


 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they are
 addressed. Please note that any disclosure, copying or distribution of the
 content of this information is strictly forbidden. If you have received
 this email message in error, please destroy it immediately and notify its
 sender.*

>>>
>>>
>>
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. Please note that any disclosure, copying or distribution of the
>> content of this information is strictly forbidden. If you have received
>> this email message in error, please destroy it immediately and notify its
>> sender.*
>>
>
>


Re: Stop Spark yarn-client job

2015-11-26 Thread Jeff Zhang
Could you attach the yarn AM log ?

On Fri, Nov 27, 2015 at 8:10 AM, Jagat Singh  wrote:

> Hi,
>
> What is the correct way to stop fully the Spark job which is running as
> yarn-client using spark-submit.
>
> We are using sc.stop in the code and can see the job still running (in
> yarn resource manager) after final hive insert is complete.
>
> The code flow is
>
> start context
> do somework
> insert to hive
> sc.stop
>
> This is sparkling water job is that matters.
>
> Is there anything else needed ?
>
> Thanks,
>
> J
>
>
>


-- 
Best Regards

Jeff Zhang


Stop Spark yarn-client job

2015-11-26 Thread Jagat Singh
Hi,

What is the correct way to stop fully the Spark job which is running as
yarn-client using spark-submit.

We are using sc.stop in the code and can see the job still running (in yarn
resource manager) after final hive insert is complete.

The code flow is

start context
do somework
insert to hive
sc.stop

This is sparkling water job is that matters.

Is there anything else needed ?

Thanks,

J


Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi
HDFS has a default replication factor of 3 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Millions of entities in custom Hadoop InputFormat and broadcast variable

2015-11-26 Thread Anfernee Xu
Hi Spark experts,

First of all, happy Thanksgiving!

The comes to my question, I have implemented custom Hadoop InputFormat to
load millions of entities from my data source to Spark(as JavaRDD and
transform to DataFrame). The approach I took in implementing the custom
Hadoop RDD is loading all ID's of my data entity(each entity has an unique
ID: Long) and split the ID list(contains 3 millions of Long number for
example) into configured splits, each split contains a sub-set of ID's, in
turn my custom RecordReader will load the full entity(a plain Java Bean)
from my data source for each ID in the specific split.

My first observation is some Spark tasks were timeout, and looks like Spark
broadcast variable is being used to distribute my splits, is that correct?
If so, from performance perspective, what enhancement I can make to make it
better?

Thanks

-- 
--Anfernee


Re: GraphX - How to make a directed graph an undirected graph?

2015-11-26 Thread Robineast
1. GraphX doesn't have a concept of undirected graphs, Edges are always
specified with a srcId and dstId. However there is nothing to stop you
adding in edges that point in the other direction i.e. if you have an edge
with srcId -> dstId you can add an edge dstId -> srcId

2. In general APIs will return a single Graph object even if the resulting
graph is partitioned. You should read the API docs for the specifics though



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-How-to-make-a-directed-graph-an-undirected-graph-tp25495p25499.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



error while creating HiveContext

2015-11-26 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

I am building a spark-sql application in Java. I created a maven project in 
Eclipse and added all dependencies including spark-core and spark-sql. I am 
creating HiveContext in my spark program and then try to run sql queries 
against my Hive Table. When I submit this job in spark, for some reasons it is 
trying to create derby metastore. But my hive-site.xml clearly specifies the 
jdbc url of my MySQL . So I think my hive-site.xml is not getting picked by 
spark program. I specified hive-site.xml path using "-files" argument in 
spark-submit. I also tried placing hive-site.xml file in my jar . I even tried 
creating Configuration object with hive-site.xml path and updated my 
HiveContext by calling addResource() method.

I want to know where I should put hive config files in my jar or in my eclipse 
project or in my cluster for it to be picked by correctly in my spark program.

Thanks for any help.

Regards,
Anand.C



Optimizing large collect operations

2015-11-26 Thread Gylfi
Hi. 

I am doing very large collectAsMap() operations, about 10,000,000 records,
and I am getting 
"org.apache.spark.SparkException: Error communicating with MapOutputTracker"
errors.. 

details: 
"org.apache.spark.SparkException: Error communicating with MapOutputTracker
at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Error sending message [message =
GetMapOutputStatuses(1)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113)
... 12 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
... 13 more"

I have already set set the akka.timeout to 300 etc. 
Anyone have any ideas on what the problem could be ?

Regares, 
Gylfi. 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-large-collect-operations-tp25498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



controlling parquet file sizes for faster transfer to S3 from HDFS

2015-11-26 Thread AlexG
Is there a way to control how large the part- files are for a parquet
dataset? I'm currently using e.g.

results.toDF.coalesce(60).write.mode("append").parquet(outputdir)

to manually reduce the number of parts, but this doesn't map linearly to
fewer parts: I noticed that coalescing to 30 actually gives smaller parts.
I'd like to be able to specify the size of the parts- directly rather than
guess and check what coalesce value to use.

Why I care: my data is ~3Tb in Parquet form, with about 16 thousand files of
around 200MB each. Transferring this from HDFS on EC2 to S3 based on the
transfer rate I calculated from the yarn webui's progress indicator will
take more than 4 hours. By way of comparison, when I transferred 3.8 Tb of
data out from S3 to HDFS on EC2, that only took about 1.5 hours; there the
files were 1.7 Gb each. 

Minimizing the transfer time is important because I'll be taking the dataset
out of S3 many times.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/controlling-parquet-file-sizes-for-faster-transfer-to-S3-from-HDFS-tp25490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



starting start-master.sh throws "java.lang.ClassNotFoundException: org.slf4j.Logger" error

2015-11-26 Thread Mich Talebzadeh
Hi,

 

I just built spark without hive jars and trying to run

 

start-master.sh

 

I get this error in the log. Sounds like it cannot find
java.lang.ClassNotFoundException: org.slf4j.Logger

 

Spark Command: /usr/java/latest/bin/java -cp
/usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-assembly-1.5.2-hadoop2
.6.0.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m
org.apache.spark.deploy.master.Master --ip rhes564 --port 7077 --webui-port
8080



Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at
sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

 

Although I have added to the CLASSPATH.

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 



Re: RE: Spark checkpoint problem

2015-11-26 Thread eric wong
I don't think it is a deliberate design.

So you may need do action on the  RDD before the action of 
RDD, if you want to explicitly checkpoint  RDD.


2015-11-26 13:23 GMT+08:00 wyphao.2007 :

> Spark 1.5.2.
>
> 在 2015-11-26 13:19:39,"张志强(旺轩)"  写道:
>
> What’s your spark version?
>
> *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com]
> *发送时间:* 2015年11月26日 10:04
> *收件人:* user
> *抄送:* d...@spark.apache.org
> *主题:* Spark checkpoint problem
>
> I am test checkpoint to understand how it works, My code as following:
>
>
>
> scala> val data = sc.parallelize(List("a", "b", "c"))
>
> data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at :15
>
>
>
> scala> sc.setCheckpointDir("/tmp/checkpoint")
>
> 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be
> non-local if Spark is running on a cluster: /tmp/checkpoint1
>
>
>
> scala> data.checkpoint
>
>
>
> scala> val temp = data.map(item => (item, 1))
>
> temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map
> at :17
>
>
>
> scala> temp.checkpoint
>
>
>
> scala> temp.count
>
>
>
> but I found that only the temp RDD is checkpont in the /tmp/checkpoint
> directory, The data RDD is not checkpointed! I found the doCheckpoint
> function  in the org.apache.spark.rdd.RDD class:
>
>
>
>   private[spark] def doCheckpoint(): Unit = {
>
> RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false,
> ignoreParent = true) {
>
>   if (!doCheckpointCalled) {
>
> doCheckpointCalled = true
>
> if (checkpointData.isDefined) {
>
>   checkpointData.get.checkpoint()
>
> } else {
>
>   dependencies.foreach(_.rdd.doCheckpoint())
>
> }
>
>   }
>
> }
>
>   }
>
>
>
> from the code above, Only the last RDD(In my case is temp) will be
> checkpointed, My question : Is deliberately designed or this is a bug?
>
>
>
> Thank you.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



-- 
王海华


ClassNotFoundException with a uber jar.

2015-11-26 Thread Marc de Palol
Hi all, 

I have a uber jar made with maven, the contents are:

my.org.my.classes.Class
...
lib/lib1.jar // 3rd party libs
lib/lib2.jar 

I'm using this kind of jar for hadoop applications and all works fine. 

I added spark libs, scala and everything needed in spark, but when I submit
this jar to spark I get ClassNotFoundExceptions: 

spark-submit --class com.bla.TestJob --driver-memory 512m --master
yarn-client /home/ble/uberjar.jar

Then when the job is running I get this: 
java.lang.NoClassDefFoundError:
com/fasterxml/jackson/datatype/guava/GuavaModule
// usage of jackson's GuavaModule is expected, as the job is using jackson
to read json.


this class is contained in: 
lib/jackson-datatype-guava-2.4.3.jar, which is in the uberjar

So I really don't know what I'm missing. I've tried to use --jars and
SparkContext.addJar (adding the uberjar) with no luck. 

Is there any problem using uberjars with inner jars inside ? 

Thanks!






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-a-uber-jar-tp25493.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ClassNotFoundException with a uber jar.

2015-11-26 Thread Ali Tajeldin EDU
I'm not %100 sure, but I don't think a jar within a jar will work without a 
custom class loader.  You can perhaps try to use "maven-assembly-plugin" or 
"maven-shade-plugin" to build your uber/fat jar.  Both of these will build a 
flattened single jar.
--
Ali

On Nov 26, 2015, at 2:49 AM, Marc de Palol  wrote:

> Hi all, 
> 
> I have a uber jar made with maven, the contents are:
> 
> my.org.my.classes.Class
> ...
> lib/lib1.jar // 3rd party libs
> lib/lib2.jar 
> 
> I'm using this kind of jar for hadoop applications and all works fine. 
> 
> I added spark libs, scala and everything needed in spark, but when I submit
> this jar to spark I get ClassNotFoundExceptions: 
> 
> spark-submit --class com.bla.TestJob --driver-memory 512m --master
> yarn-client /home/ble/uberjar.jar
> 
> Then when the job is running I get this: 
> java.lang.NoClassDefFoundError:
> com/fasterxml/jackson/datatype/guava/GuavaModule
> // usage of jackson's GuavaModule is expected, as the job is using jackson
> to read json.
> 
> 
> this class is contained in: 
> lib/jackson-datatype-guava-2.4.3.jar, which is in the uberjar
> 
> So I really don't know what I'm missing. I've tried to use --jars and
> SparkContext.addJar (adding the uberjar) with no luck. 
> 
> Is there any problem using uberjars with inner jars inside ? 
> 
> Thanks!
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-a-uber-jar-tp25493.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Sahil Sareen
Im using Spark1.4.2 with Hadoop 2.7, I tried increasing
spark.shuffle.io.maxRetries to 10  but didn't help.

Any ideas on what could be causing this??

This is the exception that I am getting:

[MySparkApplication] WARN : Failed to execute SQL statement select *
from TableS s join TableC c on s.property = c.property from X YZ
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 4 in stage 5710.0 failed 4 times, most recent failure: Lost task
4.3 in stage 5710.0 (TID 341269,
ip-10-0-1-80.us-west-2.compute.internal):
java.io.FileNotFoundException:
/mnt/md0/var/lib/spark/spark-549f7d96-82da-4b8d-b9fe-7f6fe8238478/blockmgr-
f44be41a-9036-4b93-8608-4a8b2fabbc06/0b/shuffle_3257_4_0.data
(Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.open(
BlockObjectWriter.scala:128)
at org.apache.spark.storage.DiskBlockObjectWriter.write(
BlockObjectWriter.scala:203)
at org.apache.spark.util.collection.WritablePartitionedIterator$$
anon$3.writeNext(WritablePartitionedPairCollection.scala:104)
at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(
ExternalSorter.scala:757)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(
SortShuffleWriter.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$
scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1276)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
DAGScheduler.scala:1267)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
DAGScheduler.scala:1266)
at scala.collection.mutable.ResizableArray$class.foreach(
ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(
DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$
handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$
handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
onReceive(DAGScheduler.scala:1460)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
onReceive(DAGScheduler.scala:1421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Thanks
Sahil


[no subject]

2015-11-26 Thread Dmitry Tolpeko



custom inputformat recordreader

2015-11-26 Thread Patcharee Thongtra

Hi,

In python how to use inputformat/custom recordreader?

Thanks,
Patcharee


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread luohui20001
hi guys, when I am trying to connect hive with spark-sql,I got a problem 
like below:
[root@master spark]# bin/spark-shell --master local[4]log4j:WARN No appenders 
could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please 
initialize the log4j system properly.log4j:WARN See 
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Using 
Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.propertiesTo 
adjust logging level use sc.setLogLevel("INFO")Welcome to   
 __ / __/__  ___ _/ /___\ \/ _ \/ _ `/ __/  '_/   /___/ .__/\_,_/_/ 
/_/\_\   version 1.5.2  /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_65)Type in expressions to have them evaluated.Type :help for more 
information.15/11/26 21:14:35 WARN Utils: Your hostname, master resolves to a 
loopback address: 127.0.1.1; using 10.60.162.236 instead (on interface 
eth1)15/11/26 21:14:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
another address15/11/26 21:14:35 WARN SparkConf:SPARK_CLASSPATH was detected 
(set to ':/usr/lib/spark/lib/mysql-connector-java-5.1.21-bin.jar').This is 
deprecated in Spark 1.0+.
Please instead use: - ./spark-submit with --driver-class-path to augment the 
driver classpath - spark.executor.extraClassPath to augment the executor 
classpath
15/11/26 21:14:35 WARN SparkConf: Setting 'spark.executor.extraClassPath' to 
':/usr/lib/spark/lib/mysql-connector-java-5.1.21-bin.jar' as a 
work-around.15/11/26 21:14:35 WARN SparkConf: Setting 
'spark.driver.extraClassPath' to 
':/usr/lib/spark/lib/mysql-connector-java-5.1.21-bin.jar' as a 
work-around.15/11/26 21:14:36 WARN MetricsSystem: Using default name 
DAGScheduler for source because spark.app.id is not set.Spark context available 
as sc.15/11/26 21:14:38 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)15/11/26 21:14:39 WARN Connection: BoneCP 
specified but not present in CLASSPATH (or one of dependencies)15/11/26 
21:14:44 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.015/11/26 21:14:44 WARN ObjectStore: Failed to get database 
default, returning NoSuchObjectException15/11/26 21:14:46 WARN 
NativeCodeLoader: Unable to load native-hadoop library for your platform... 
using builtin-java classes where applicable15/11/26 21:14:46 WARN Connection: 
BoneCP specified but not present in CLASSPATH (or one of dependencies)15/11/26 
21:14:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one 
of dependencies)15/11/26 21:14:48 ERROR Datastore: Error thrown executing 
CREATE TABLE `PARTITION_PARAMS`(`PART_ID` BIGINT NOT NULL,`PARAM_KEY` 
VARCHAR(256) BINARY NOT NULL,`PARAM_VALUE` VARCHAR(4000) BINARY NULL,
CONSTRAINT `PARTITION_PARAMS_PK` PRIMARY KEY (`PART_ID`,`PARAM_KEY`)) 
ENGINE=INNODB : Specified key was too long; max key length is 767 
bytescom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key 
was too long; max key length is 767 bytes

you may access full log file  in the compressed file. And the 
conf/hive-site.xml is also attached here.
I tried to modify mysql charcter set to latin1 and utf8,both doesn't work. Even 
this exception may occur when loading data to my table.
Any idea will be appreciated.



 

ThanksBest regards!
San.Luo


MysqlException.rar
Description: application/rar-compressed

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: custom inputformat recordreader

2015-11-26 Thread Ted Yu
Please take a look at:
python//pyspark/tests.py

There're examples using sc.hadoopFile() and sc.newAPIHadoopRDD()

Cheers

On Thu, Nov 26, 2015 at 4:50 AM, Patcharee Thongtra <
patcharee.thong...@uni.no> wrote:

> Hi,
>
> In python how to use inputformat/custom recordreader?
>
> Thanks,
> Patcharee
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Ted Yu
bq. (Permission denied)

Have you checked the permission for /mnt/md0/var/lib/spark/... ?

Cheers

On Thu, Nov 26, 2015 at 3:03 AM, Sahil Sareen  wrote:

> Im using Spark1.4.2 with Hadoop 2.7, I tried increasing
> spark.shuffle.io.maxRetries to 10  but didn't help.
>
> Any ideas on what could be causing this??
>
> This is the exception that I am getting:
>
> [MySparkApplication] WARN : Failed to execute SQL statement select *
> from TableS s join TableC c on s.property = c.property from X YZ
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 4 in stage 5710.0 failed 4 times, most recent failure: Lost task
> 4.3 in stage 5710.0 (TID 341269,
> ip-10-0-1-80.us-west-2.compute.internal):
> java.io.FileNotFoundException:
> /mnt/md0/var/lib/spark/spark-549f7d96-82da-4b8d-b9fe-
> 7f6fe8238478/blockmgr-f44be41a-9036-4b93-8608-
> 4a8b2fabbc06/0b/shuffle_3257_4_0.data
> (Permission denied)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at org.apache.spark.storage.DiskBlockObjectWriter.open(
> BlockObjectWriter.scala:128)
> at org.apache.spark.storage.DiskBlockObjectWriter.write(
> BlockObjectWriter.scala:203)
> at org.apache.spark.util.collection.WritablePartitionedIterator$$
> anon$3.writeNext(WritablePartitionedPairCollection.scala:104)
> at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(
> ExternalSorter.scala:757)
> at org.apache.spark.shuffle.sort.SortShuffleWriter.write(
> SortShuffleWriter.scala:70)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:70)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1276)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1267)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1266)
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1266)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
> DAGScheduler.scala:730)
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1460)
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1421)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> Thanks
> Sahil
>


Help with Couchbase connector error

2015-11-26 Thread Eyal Sharon
Hi,

I am trying to set a connection to Couchbase. I am at the very beginning,
and I got stuck on   this exception

Exception in thread "main" java.lang.UnsupportedClassVersionError:
com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0


Here is the simple code fragment

  val sc = new SparkContext(cfg)

  val doc1 = JsonDocument.create("doc1",
JsonObject.create().put("some", "content"))
  val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more",
"content", "in", "here"))


  val data = sc
.parallelize(Seq(doc1, doc2))
.saveToCouchbase()
}


Any help will be a bless


Thanks!

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
This implies version mismatch between the JDK used to build your jar and
the one at runtime.

When building, target JDK 1.7

There're plenty of posts on the web for dealing with such error.

Cheers

On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon  wrote:

> Hi,
>
> I am trying to set a connection to Couchbase. I am at the very beginning,
> and I got stuck on   this exception
>
> Exception in thread "main" java.lang.UnsupportedClassVersionError:
> com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0
>
>
> Here is the simple code fragment
>
>   val sc = new SparkContext(cfg)
>
>   val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", 
> "content"))
>   val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", 
> "content", "in", "here"))
>
>
>   val data = sc
> .parallelize(Seq(doc1, doc2))
> .saveToCouchbase()
> }
>
>
> Any help will be a bless
>
>
> Thanks!
>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>


Re: MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtCoKmv14Hd1H1=Re+Spark+Hive+max+key+length+is+767+bytes

On Thu, Nov 26, 2015 at 5:26 AM,  wrote:

> hi guys,
>
>  when I am trying to connect hive with spark-sql,I got a problem like
> below:
>
>
> [root@master spark]# bin/spark-shell --master local[4]
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>
> log4j:WARN Please initialize the log4j system properly.
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> Using Spark's repl log4j profile:
> org/apache/spark/log4j-defaults-repl.properties
>
> To adjust logging level use sc.setLogLevel("INFO")
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>
>   /_/
>
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_65)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
> 15/11/26 21:14:35 WARN Utils: Your hostname, master resolves to a loopback
> address: 127.0.1.1; using 10.60.162.236 instead (on interface eth1)
>
> 15/11/26 21:14:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
>
> 15/11/26 21:14:35 WARN SparkConf:
>
> SPARK_CLASSPATH was detected (set to
> ':/usr/lib/spark/lib/mysql-connector-java-5.1.21-bin.jar').
>
> This is deprecated in Spark 1.0+.
>
>
> Please instead use:
>
>  - ./spark-submit with --driver-class-path to augment the driver classpath
>
>  - spark.executor.extraClassPath to augment the executor classpath
>
>
> 15/11/26 21:14:35 WARN SparkConf: Setting 'spark.executor.extraClassPath'
> to ':/usr/lib/spark/lib/mysql-connector-java-5.1.21-bin.jar' as a
> work-around.
>
> 15/11/26 21:14:35 WARN SparkConf: Setting 'spark.driver.extraClassPath' to
> ':/usr/lib/spark/lib/mysql-connector-java-5.1.21-bin.jar' as a work-around.
>
> 15/11/26 21:14:36 WARN MetricsSystem: Using default name DAGScheduler for
> source because spark.app.id is not set.
>
> Spark context available as sc.
>
> 15/11/26 21:14:38 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/26 21:14:39 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/26 21:14:44 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
>
> 15/11/26 21:14:44 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
>
> 15/11/26 21:14:46 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 15/11/26 21:14:46 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/26 21:14:46 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
>
> 15/11/26 21:14:48 ERROR Datastore: Error thrown executing CREATE TABLE
> `PARTITION_PARAMS`
>
> (
>
> `PART_ID` BIGINT NOT NULL,
>
> `PARAM_KEY` VARCHAR(256) BINARY NOT NULL,
>
> `PARAM_VALUE` VARCHAR(4000) BINARY NULL,
>
> CONSTRAINT `PARTITION_PARAMS_PK` PRIMARY KEY (`PART_ID`,`PARAM_KEY`)
>
> ) ENGINE=INNODB : Specified key was too long; max key length is 767 bytes
>
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key
> was too long; max key length is 767 bytes
>
>
>
> you may access full log file  in the compressed file. And the
> conf/hive-site.xml is also attached here.
>
>
> I tried to modify mysql charcter set to latin1 and utf8,both doesn't work.
> Even this exception may occur when loading data to my table.
>
>
> Any idea will be appreciated.
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Help with Couchbase connector error

2015-11-26 Thread Eyal Sharon
Hi ,
Great , that gave some directions. But can you elaborate more?  or share
some post
I am currently running JDK 7 , and  my Couchbase too

Thanks !

On Thu, Nov 26, 2015 at 6:02 PM, Ted Yu  wrote:

> This implies version mismatch between the JDK used to build your jar and
> the one at runtime.
>
> When building, target JDK 1.7
>
> There're plenty of posts on the web for dealing with such error.
>
> Cheers
>
> On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon  wrote:
>
>> Hi,
>>
>> I am trying to set a connection to Couchbase. I am at the very beginning,
>> and I got stuck on   this exception
>>
>> Exception in thread "main" java.lang.UnsupportedClassVersionError:
>> com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0
>>
>>
>> Here is the simple code fragment
>>
>>   val sc = new SparkContext(cfg)
>>
>>   val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", 
>> "content"))
>>   val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", 
>> "content", "in", "here"))
>>
>>
>>   val data = sc
>> .parallelize(Seq(doc1, doc2))
>> .saveToCouchbase()
>> }
>>
>>
>> Any help will be a bless
>>
>>
>> Thanks!
>>
>>
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. Please note that any disclosure, copying or distribution of the
>> content of this information is strictly forbidden. If you have received
>> this email message in error, please destroy it immediately and notify its
>> sender.*
>>
>
>

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
StoreMode is from Couchbase connector.

Where did you obtain the connector ?

See also
http://stackoverflow.com/questions/1096148/how-to-check-the-jdk-version-used-to-compile-a-class-file

On Thu, Nov 26, 2015 at 8:55 AM, Eyal Sharon  wrote:

> Hi ,
> Great , that gave some directions. But can you elaborate more?  or share
> some post
> I am currently running JDK 7 , and  my Couchbase too
>
> Thanks !
>
> On Thu, Nov 26, 2015 at 6:02 PM, Ted Yu  wrote:
>
>> This implies version mismatch between the JDK used to build your jar and
>> the one at runtime.
>>
>> When building, target JDK 1.7
>>
>> There're plenty of posts on the web for dealing with such error.
>>
>> Cheers
>>
>> On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon  wrote:
>>
>>> Hi,
>>>
>>> I am trying to set a connection to Couchbase. I am at the very
>>> beginning, and I got stuck on   this exception
>>>
>>> Exception in thread "main" java.lang.UnsupportedClassVersionError:
>>> com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0
>>>
>>>
>>> Here is the simple code fragment
>>>
>>>   val sc = new SparkContext(cfg)
>>>
>>>   val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", 
>>> "content"))
>>>   val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", 
>>> "content", "in", "here"))
>>>
>>>
>>>   val data = sc
>>> .parallelize(Seq(doc1, doc2))
>>> .saveToCouchbase()
>>> }
>>>
>>>
>>> Any help will be a bless
>>>
>>>
>>> Thanks!
>>>
>>>
>>> *This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. Please note that any disclosure, copying or distribution of the
>>> content of this information is strictly forbidden. If you have received
>>> this email message in error, please destroy it immediately and notify its
>>> sender.*
>>>
>>
>>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>


question about combining small parquet files

2015-11-26 Thread Nezih Yigitbasi
Hi Spark people,
I have a Hive table that has a lot of small parquet files and I am
creating a data frame out of it to do some processing, but since I have a
large number of splits/files my job creates a lot of tasks, which I don't
want. Basically what I want is the same functionality that Hive provides,
that is, to combine these small input splits into larger ones by specifying
a max split size setting. Is this currently possible with Spark?

I look at coalesce() but with coalesce I can only control the number
of output files not their sizes. And since the total input dataset size
can vary significantly in my case, I cannot just use a fixed partition
count as the size of each output file can get very large. I then looked for
getting the total input size from an rdd to come up with some heuristic to
set the partition count, but I couldn't find any ways to do it (without
modifying the spark source).

Any help is appreciated.

Thanks,
Nezih

PS: this email is the same as my previous email as I learned that my
previous email ended up as spam for many people since I sent it through
nabble, sorry for the double post.


Re: question about combining small parquet files

2015-11-26 Thread Ruslan Dautkhanov
An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/


AFAIK Spark supports views too.


-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
nyigitb...@netflix.com.invalid> wrote:

> Hi Spark people,
> I have a Hive table that has a lot of small parquet files and I am
> creating a data frame out of it to do some processing, but since I have a
> large number of splits/files my job creates a lot of tasks, which I don't
> want. Basically what I want is the same functionality that Hive provides,
> that is, to combine these small input splits into larger ones by specifying
> a max split size setting. Is this currently possible with Spark?
>
> I look at coalesce() but with coalesce I can only control the number
> of output files not their sizes. And since the total input dataset size
> can vary significantly in my case, I cannot just use a fixed partition
> count as the size of each output file can get very large. I then looked for
> getting the total input size from an rdd to come up with some heuristic to
> set the partition count, but I couldn't find any ways to do it (without
> modifying the spark source).
>
> Any help is appreciated.
>
> Thanks,
> Nezih
>
> PS: this email is the same as my previous email as I learned that my
> previous email ended up as spam for many people since I sent it through
> nabble, sorry for the double post.
>


possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-26 Thread Andy Davidson
I am using spark-1.5.1-bin-hadoop2.6. I used
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured
spark-env to use python3. I get and exception 'Randomness of hash of string
should be disabled via PYTHONHASHSEED¹. Is there any reason rdd.py should
not just set PYTHONHASHSEED ?

Should I file a bug?

Kind regards

Andy

details

http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtra
ct#pyspark.RDD.subtract

Example does not work out of the box

Subtract(other, numPartitions=None)


Return each value in self that is not contained in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
It raises 

if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via
PYTHONHASHSEED")


The following script fixes the problem

Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
Exception'Randomness of hash of string should be disabled via
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >>
/root/spark/conf/spark-env.sh

sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh
/root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`

Sudo for i in `cat slaves` ; do scp spark-env.sh
root@$i:/root/spark/conf/spark-env.sh; done







Re: UDF with 2 arguments

2015-11-26 Thread Daniel Lopes
Thanks Davies and Nathan,

I found my error.

I was using *ArrayType()* and I need to pass de kind of type has in this
array and I has not passing *ArrayType(IntegerType())*.

Thanks :)

On Wed, Nov 25, 2015 at 7:46 PM, Davies Liu  wrote:

> It works in master (1.6), what's the version of Spark you have?
>
> >>> from pyspark.sql.functions import udf
> >>> def f(a, b): pass
> ...
> >>> my_udf = udf(f)
> >>> from pyspark.sql.types import *
> >>> my_udf = udf(f, IntegerType())
>
>
> On Wed, Nov 25, 2015 at 12:01 PM, Daniel Lopes 
> wrote:
> > Hallo,
> >
> > supose I have function in pyspark that
> >
> > def function(arg1,arg2):
> >   pass
> >
> > and
> >
> > udf_function = udf(function, IntegerType())
> >
> > that takes me error
> >
> > Traceback (most recent call last):
> >   File "", line 1, in 
> > TypeError: __init__() takes at least 2 arguments (1 given)
> >
> >
> > How I use?
> >
> > Best,
> >
> >
> > --
> > Daniel Lopes, B.Eng
> > Data Scientist - BankFacil
> > CREA/SP 5069410560
> > Mob +55 (18) 99764-2733
> > Ph +55 (11) 3522-8009
> > http://about.me/dannyeuu
> >
> > Av. Nova Independência, 956, São Paulo, SP
> > Bairro Brooklin Paulista
> > CEP 04570-001
> > https://www.bankfacil.com.br
> >
>



-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560

Mob +55 (18) 99764-2733 
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br


Re: error while creating HiveContext

2015-11-26 Thread fightf...@163.com
Hi, 
I think you just want to put the hive-site.xml in the spark/conf directory and 
it would load 
it into spark classpath.

Best,
Sun.



fightf...@163.com
 
From: Chandra Mohan, Ananda Vel Murugan
Date: 2015-11-27 15:04
To: user
Subject: error while creating HiveContext
Hi, 
 
I am building a spark-sql application in Java. I created a maven project in 
Eclipse and added all dependencies including spark-core and spark-sql. I am 
creating HiveContext in my spark program and then try to run sql queries 
against my Hive Table. When I submit this job in spark, for some reasons it is 
trying to create derby metastore. But my hive-site.xml clearly specifies the 
jdbc url of my MySQL . So I think my hive-site.xml is not getting picked by 
spark program. I specified hive-site.xml path using “—files” argument in 
spark-submit. I also tried placing hive-site.xml file in my jar . I even tried 
creating Configuration object with hive-site.xml path and updated my 
HiveContext by calling addResource() method.   
 
I want to know where I should put hive config files in my jar or in my eclipse 
project or in my cluster for it to be picked by correctly in my spark program. 
 
Thanks for any help. 
 
Regards,
Anand.C
 


Spark on yarn vs spark standalone

2015-11-26 Thread cs user
Hi All,

Apologies if this question has been asked before. I'd like to know if there
are any downsides to running spark over yarn with the --master yarn-cluster
option vs having a separate spark standalone cluster to execute jobs?

We're looking at installing a hdfs/hadoop cluster with Ambari and
submitting jobs to the cluster using yarn, or having an Ambari cluster and
a separate standalone spark cluster, which will run the spark jobs on data
within hdfs.

With yarn, will we still get all the benefits of spark?

Will it be possible to process streaming data?

Many thanks in advance for any responses.

Cheers!


Re: Spark on yarn vs spark standalone

2015-11-26 Thread Jeff Zhang
If your cluster is a dedicated spark cluster (only running spark job, no
other jobs like hive/pig/mr), then spark standalone would be fine.
Otherwise I think yarn would be a better option.

On Fri, Nov 27, 2015 at 3:36 PM, cs user  wrote:

> Hi All,
>
> Apologies if this question has been asked before. I'd like to know if
> there are any downsides to running spark over yarn with the --master
> yarn-cluster option vs having a separate spark standalone cluster to
> execute jobs?
>
> We're looking at installing a hdfs/hadoop cluster with Ambari and
> submitting jobs to the cluster using yarn, or having an Ambari cluster and
> a separate standalone spark cluster, which will run the spark jobs on data
> within hdfs.
>
> With yarn, will we still get all the benefits of spark?
>
> Will it be possible to process streaming data?
>
> Many thanks in advance for any responses.
>
> Cheers!
>



-- 
Best Regards

Jeff Zhang


Re: Optimizing large collect operations

2015-11-26 Thread Jeff Zhang
For such large output, I would suggest you to do the following processing
in cluster rather than in driver (use RDD api to do that).
If you really want to pull it to driver, then you can first save it in hdfs
and then read it using hdfs api to avoid the akka issue

On Fri, Nov 27, 2015 at 2:41 PM, Gylfi  wrote:

> Hi.
>
> I am doing very large collectAsMap() operations, about 10,000,000 records,
> and I am getting
> "org.apache.spark.SparkException: Error communicating with
> MapOutputTracker"
> errors..
>
> details:
> "org.apache.spark.SparkException: Error communicating with MapOutputTracker
> at
> org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117)
> at
>
> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164)
> at
>
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
> at
>
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Error sending message [message
> =
> GetMapOutputStatuses(1)]
> at
> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
> at
> org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113)
> ... 12 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [300 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at
> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
> ... 13 more"
>
> I have already set set the akka.timeout to 300 etc.
> Anyone have any ideas on what the problem could be ?
>
> Regares,
> Gylfi.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-large-collect-operations-tp25498.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang