Spark runs only on Mesos v0.21?

2016-02-12 Thread Petr Novak
Hi all,
based on documenation:

"Spark 1.6.0 is designed for use with Mesos 0.21.0 and does not require any
special patches of Mesos."

We are considering Mesos for our use but this concerns me a lot. Mesos is
currently on v0.27 which we need for its Volumes feature. But Spark locks
us to 0.21 only. I understand that it is the problem that Mesos is not 1.0
yet and make breaking changes to its API. But when leading frameworks
doesn't catch up fast it beats the whole purpose of Mesos - run on one
unified platform and share resources, single framework can lock down Mesos
upgrade.

We don't want to develop our own services against Spark 0.21 API.

Is there a reason why Spark is so much behind? Does Spark actually cares
about Mesos and its support or the focus moved to YARN?

Many thanks,
Petr


Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Arkadiusz Bicz
Hi,

You need good monitoring tools to send you alarms about disk, network
or  applications errors, but I think it is general dev ops work not
very specific to spark or hadoop.

BR,

Arkadiusz Bicz
https://www.linkedin.com/in/arkadiuszbicz

On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson
 wrote:
> We recently started a Spark/Spark Streaming POC. We wrote a simple streaming
> app in java to collect tweets. We choose twitter because we new we get a lot
> of data and probably lots of burst. Good for stress testing
>
> We spun up  a couple of small clusters using the spark-ec2 script. In one
> cluster we wrote all the tweets to HDFS in a second cluster we write all the
> tweets to S3
>
> We were surprised that our HDFS file system reached 100 % of capacity in a
> few days. This resulted with “all data nodes dead”. We where surprised
> because the actually stream app continued to run. We had no idea we had a
> problem until a day or two after the disk became full when we noticed we
> where missing a lot of data.
>
> We ran into a similar problem with our s3 cluster. We had a permission
> problem and where un able to write any data yet our stream app continued to
> run
>
>
> Spark generated mountains of logs,We are using the stand alone cluster
> manager. All the log levels wind up in the “error” log. Making it hard to
> find real errors and warnings using the web UI. Our app is written in Java
> so my guess is the write errors must be unable. I.E. We did not know in
> advance that they could occur . They are basically undocumented.
>
>
>
> We are a small shop. Running something like splunk would add a lot of
> expense and complexity for us at this stage of our growth.
>
> What are best practices
>
> Kind Regards
>
> Andy

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark runs only on Mesos v0.21?

2016-02-12 Thread Tamas Szuromi
Hello Petr,

We're running Spark 1.5.2 and 1.6.0 on Mesos 0.25.0 without any problem. We
upgraded from 0.21.0 originally.

cheers,
Tamas




On 12 February 2016 at 09:31, Petr Novak  wrote:

> Hi all,
> based on documenation:
>
> "Spark 1.6.0 is designed for use with Mesos 0.21.0 and does not require
> any special patches of Mesos."
>
> We are considering Mesos for our use but this concerns me a lot. Mesos is
> currently on v0.27 which we need for its Volumes feature. But Spark locks
> us to 0.21 only. I understand that it is the problem that Mesos is not 1.0
> yet and make breaking changes to its API. But when leading frameworks
> doesn't catch up fast it beats the whole purpose of Mesos - run on one
> unified platform and share resources, single framework can lock down Mesos
> upgrade.
>
> We don't want to develop our own services against Spark 0.21 API.
>
> Is there a reason why Spark is so much behind? Does Spark actually cares
> about Mesos and its support or the focus moved to YARN?
>
> Many thanks,
> Petr
>


Re: How to parallel read files in a directory

2016-02-12 Thread Arkadiusz Bicz
Hi Junjie,

>From my experience HDFS is slow reading large amount of small files as
every file come with lot of information from namenode and data nodes.
When file size is bellow HDFS default block (usually 64MB or 128MB)
size you can not use fully optimizations of Hadoop to read  in
streamed way lot of data.

Also when using DataFrames there is huge overhead by caching files
information as described in
https://issues.apache.org/jira/browse/SPARK-11441

BR,
Arkadiusz Bicz
https://www.linkedin.com/in/arkadiuszbicz

On Thu, Feb 11, 2016 at 7:24 PM, Jakob Odersky  wrote:
> Hi Junjie,
>
> How do you access the files currently? Have you considered using hdfs? It's
> designed to be distributed across a cluster and Spark has built-in support.
>
> Best,
> --Jakob
>
> On Feb 11, 2016 9:33 AM, "Junjie Qian"  wrote:
>>
>> Hi all,
>>
>> I am working with Spark 1.6, scala and have a big dataset divided into
>> several small files.
>>
>> My question is: right now the read operation takes really long time and
>> often has RDD warnings. Is there a way I can read the files in parallel,
>> that all nodes or workers read the file at the same time?
>>
>> Many thanks
>> Junjie

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
Sure. I ran the same job with fewer columns, the exception:

java.lang.IllegalArgumentException: requirement failed: DataFrame must
have the same schema as the relation to which is inserted.
DataFrame schema: StructType(StructField(pixel0,ByteType,true),
StructField(pixel1,ByteType,true), StructField(pixel10,ByteType,true),
StructField(pixel100,ShortType,true),
StructField(pixel101,ShortType,true),
StructField(pixel102,ShortType,true),
StructField(pixel103,ShortType,true),
StructField(pixel105,ShortType,true),
StructField(pixel106,ShortType,true), StructField(id,DoubleType,true),
StructField(label,ByteType,true),
StructField(predict,DoubleType,true))
Relation schema: StructType(StructField(pixel0,ByteType,true),
StructField(pixel1,ByteType,true), StructField(pixel10,ByteType,true),
StructField(pixel100,ShortType,true),
StructField(pixel101,ShortType,true),
StructField(pixel102,ShortType,true),
StructField(pixel103,ShortType,true),
StructField(pixel105,ShortType,true),
StructField(pixel106,ShortType,true), StructField(id,DoubleType,true),
StructField(label,ByteType,true),
StructField(predict,DoubleType,true))

at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)

Regards,

Zsolt


2016-02-12 13:11 GMT+01:00 Ted Yu :

> Can you pastebin the full error with all column types ?
>
> There should be a difference between some column(s).
>
> Cheers
>
> > On Feb 11, 2016, at 2:12 AM, Zsolt Tóth 
> wrote:
> >
> > Hi,
> >
> > I'd like to append a column of a dataframe to another DF (using Spark
> 1.5.2):
> >
> > DataFrame outputDF = unlabelledDF.withColumn("predicted_label",
> predictedDF.col("predicted"));
> >
> > I get the following exception:
> >
> > java.lang.IllegalArgumentException: requirement failed: DataFrame must
> have the same schema as the relation to which is inserted.
> > DataFrame schema:
> StructType(StructField(predicted_label,DoubleType,true), ... numerical (ByteType/ShortType) columns>
> > Relation schema:
> StructType(StructField(predicted_label,DoubleType,true), ... columns>
> >
> > The interesting part is that the two schemas in the exception are
> exactly the same.
> > The same code with other input data (with fewer, both numerical and
> non-numerical column) succeeds.
> > Any idea why this happens?
> >
>


Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
Hi,

thanks for the answers. If joining the DataFrames is the solution, then why
does the simple withColumn() succeed for some datasets and fail for others?

2016-02-11 11:53 GMT+01:00 Michał Zieliński :

> I think a good idea would be to do a join:
>
> outputDF = unlabelledDF.join(predictedDF.select(“id”,”predicted”),”id”)
>
> On 11 February 2016 at 10:12, Zsolt Tóth  wrote:
>
>> Hi,
>>
>> I'd like to append a column of a dataframe to another DF (using Spark
>> 1.5.2):
>>
>> DataFrame outputDF = unlabelledDF.withColumn("predicted_label",
>> predictedDF.col("predicted"));
>>
>> I get the following exception:
>>
>> java.lang.IllegalArgumentException: requirement failed: DataFrame must
>> have the same schema as the relation to which is inserted.
>> DataFrame schema:
>> StructType(StructField(predicted_label,DoubleType,true), ...> numerical (ByteType/ShortType) columns>
>> Relation schema: StructType(StructField(predicted_label,DoubleType,true),
>> ...
>>
>> The interesting part is that the two schemas in the exception are exactly
>> the same.
>> The same code with other input data (with fewer, both numerical and
>> non-numerical column) succeeds.
>> Any idea why this happens?
>>
>>
>


Using SPARK packages in Spark Cluster

2016-02-12 Thread Gourav Sengupta
Hi,

I am creating sparkcontext in a SPARK standalone cluster as mentioned here:
http://spark.apache.org/docs/latest/spark-standalone.html using the
following code:

--
sc.stop()
conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
  .setMaster("spark://hostname:7077") \
  .set('spark.shuffle.service.enabled', True) \
  .set('spark.dynamicAllocation.enabled','true') \
  .set('spark.executor.memory','20g') \
  .set('spark.driver.memory', '4g') \

.set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
conf.getAll()
sc = SparkContext(conf = conf)

-(we should definitely be able to optimise the configuration but that
is not the point here) ---

I am not able to use packages, a list of which is mentioned here
http://spark-packages.org, using this method.

Where as if I use the standard "pyspark --packages" option then the
packages load just fine.

I will be grateful if someone could kindly let me know how to load packages
when starting a cluster as mentioned above.


Regards,
Gourav Sengupta


Connection via JDBC to Oracle hangs after count call

2016-02-12 Thread Mich Talebzadeh
Hi,

 

I use the following to connect to Oracle DB from Spark shell 1.5.2

 

spark-shell --master spark://50.140.197.217:7077 --driver-class-path
/home/hduser/jars/ojdbc6.jar

 

in Scala I do

 

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.SQLContext@f9d4387

 

scala> val channels = sqlContext.read.format("jdbc").options(

 |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

 |  "dbtable" -> "(select * from sh.channels where channel_id =
14)",

 |  "user" -> "sh",

 |   "password" -> "xxx")).load

channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127),
CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID:
decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]

 

scala> channels.count()

 

But the latter command keeps hanging?

 

Any ideas appreciated

 

Thanks,

 

Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 



Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Igor Berman
 String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/*json” *

not sure, but
can you try to remove s3-us-west-1.amazonaws.com
 from path ?

On 11 February 2016 at 23:15, Andy Davidson 
wrote:

> I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I
> am using the standalone cluster manager
>
> My java streaming app is not able to write to s3. It appears to be some
> for of permission problem.
>
> Any idea what the problem might be?
>
> I tried use the IAM simulator to test the policy. Everything seems okay.
> Any idea how I can debug this problem?
>
> Thanks in advance
>
> Andy
>
> JavaSparkContext jsc = new JavaSparkContext(conf);
>
> // I did not include the full key in my email
>// the keys do not contain ‘\’
>// these are the keys used to create the cluster. They belong to
> the IAM user andy
>
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAJREX"
> );
>
> jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> "uBh9v1hdUctI23uvq9qR");
>
>
>
>   private static void saveTweets(JavaDStream jsonTweets, String
> outputURI) {
>
> jsonTweets.foreachRDD(new VoidFunction2() {
>
> private static final long serialVersionUID = 1L;
>
>
> @Override
>
> public void call(JavaRDD rdd, Time time) throws
> Exception {
>
> if(!rdd.isEmpty()) {
>
> // bucket name is ‘com.pws.twitter’ it has a folder ‘json'
>
> String dirPath = "s3n://
> s3-us-west-1.amazonaws.com/com.pws.twitter/*json” *+ "-" + time
> .milliseconds();
>
> rdd.saveAsTextFile(dirPath);
>
> }
>
> }
>
> });
>
>
>
>
> Bucket name : com.pws.titter
> Bucket policy (I replaced the account id)
>
> {
> "Version": "2012-10-17",
> "Id": "Policy1455148808376",
> "Statement": [
> {
> "Sid": "Stmt1455148797805",
> "Effect": "Allow",
> "Principal": {
> "AWS": "arn:aws:iam::123456789012:user/andy"
> },
> "Action": "s3:*",
> "Resource": "arn:aws:s3:::com.pws.twitter/*"
> }
> ]
> }
>
>
>


Re: How to parallel read files in a directory

2016-02-12 Thread Jörn Franke
Put many small files in Hadoop Archives (HAR) to improve performance of reading 
small files. Alternatively have a batch job concatenating them.

> On 11 Feb 2016, at 18:33, Junjie Qian  wrote:
> 
> Hi all,
> 
> I am working with Spark 1.6, scala and have a big dataset divided into 
> several small files.
> 
> My question is: right now the read operation takes really long time and often 
> has RDD warnings. Is there a way I can read the files in parallel, that all 
> nodes or workers read the file at the same time?
> 
> Many thanks
> Junjie


Re: off-heap certain operations

2016-02-12 Thread Ted Yu
Ovidiu-Cristian:
Please see the following JIRA / PR :
[SPARK-12251] Document and improve off-heap memory configurations

Cheers

On Thu, Feb 11, 2016 at 11:06 PM, Sea <261810...@qq.com> wrote:

> spark.memory.offHeap.enabled (default is false) , it is wrong in spark
> docs. Spark1.6 do not recommend to use off-heap memory.
>
>
> -- 原始邮件 --
> *发件人:* "Ovidiu-Cristian MARCU";;
> *发送时间:* 2016年2月12日(星期五) 凌晨5:51
> *收件人:* "user";
> *主题:* off-heap certain operations
>
> Hi,
>
> Reading though the latest documentation for Memory management I can see
> that the parameter spark.memory.offHeap.enabled (true by default) is
> described with ‘If true, Spark will attempt to use off-heap memory for
> certain operations’ [1].
>
> Can you please describe the certain operations you are referring to?
>
> http://spark.apache.org/docs/latest/configuration.html#memory-management
>
> Thank!
>
> Best,
> Ovidiu
>


Re: Inserting column to DataFrame

2016-02-12 Thread Ted Yu
Seems like a bug.

Suggest filing an issue with code snippet if this can be reproduced on 1.6
branch.

Cheers

On Fri, Feb 12, 2016 at 4:25 AM, Zsolt Tóth 
wrote:

> Sure. I ran the same job with fewer columns, the exception:
>
> java.lang.IllegalArgumentException: requirement failed: DataFrame must have 
> the same schema as the relation to which is inserted.
> DataFrame schema: StructType(StructField(pixel0,ByteType,true), 
> StructField(pixel1,ByteType,true), StructField(pixel10,ByteType,true), 
> StructField(pixel100,ShortType,true), StructField(pixel101,ShortType,true), 
> StructField(pixel102,ShortType,true), StructField(pixel103,ShortType,true), 
> StructField(pixel105,ShortType,true), StructField(pixel106,ShortType,true), 
> StructField(id,DoubleType,true), StructField(label,ByteType,true), 
> StructField(predict,DoubleType,true))
> Relation schema: StructType(StructField(pixel0,ByteType,true), 
> StructField(pixel1,ByteType,true), StructField(pixel10,ByteType,true), 
> StructField(pixel100,ShortType,true), StructField(pixel101,ShortType,true), 
> StructField(pixel102,ShortType,true), StructField(pixel103,ShortType,true), 
> StructField(pixel105,ShortType,true), StructField(pixel106,ShortType,true), 
> StructField(id,DoubleType,true), StructField(label,ByteType,true), 
> StructField(predict,DoubleType,true))
>
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:113)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
>
> Regards,
>
> Zsolt
>
>
> 2016-02-12 13:11 GMT+01:00 Ted Yu :
>
>> Can you pastebin the full error with all column types ?
>>
>> There should be a difference between some column(s).
>>
>> Cheers
>>
>> > On Feb 11, 2016, at 2:12 AM, Zsolt Tóth 
>> wrote:
>> >
>> > Hi,
>> >
>> > I'd like to append a column of a dataframe to another DF (using Spark
>> 1.5.2):
>> >
>> > DataFrame outputDF = unlabelledDF.withColumn("predicted_label",
>> predictedDF.col("predicted"));
>> >
>> > I get the following exception:
>> >
>> > java.lang.IllegalArgumentException: requirement failed: DataFrame must
>> have the same schema as the relation to which is inserted.
>> > DataFrame schema:
>> StructType(StructField(predicted_label,DoubleType,true), ...> numerical (ByteType/ShortType) columns>
>> > Relation schema:
>> StructType(StructField(predicted_label,DoubleType,true), ...> columns>
>> >
>> > The interesting part is that the two schemas in the exception are
>> exactly the same.
>> > The same code with other input data (with fewer, both numerical and
>> non-numerical column) succeeds.
>> > Any idea why this happens?
>> >
>>
>
>


Re: Convert Iterable to RDD

2016-02-12 Thread seb.arzt
I have an Iterator of several million elements, which unfortunately won't fit
into the driver memory at the same time. I would like to save them as object
file in HDFS:

Doing so I am running out of memory on the driver:

Using a stream

also won't work. I cannot further increase the driver memory. Why doesn't it
work out of the box? Shouldn't lazy evaluation and garbage collection
prevent the program from running out of memory? I could manually split the
Iterator into chunks and serialize each chunk, but it feels wrong. What is
going wrong here?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Convert-Iterable-to-RDD-tp16882p26211.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SparkML] RandomForestModel save on disk.

2016-02-12 Thread Eugene Morozov
Hello,

I'm building simple web service that works with spark and allows users to
train random forest model (mlib API) and use it for prediction. Trained
models are stored on the local file system (web service and spark of just
one worker are run on the same machine).
I'm concerned about prediction performance and established small load
testing to measure prediction latency. That's initially, I will set up hdfs
and bigger spark cluster.

At first I run training 5 really small models (all of them can finish
within 30 seconds).
Next my perf testing framework waits for a minute and start calling
prediction method.

Sometimes I see that not all of the 5 models were saved on disk. There is a
metadata folder for them, but not the data directory that actually contains
parquet files of the models.

I've looked through spark's jira, but haven't found anything similar.
Has anyone experience smth like this?
Could you recommend where to look for?
Might it be something with flushing it to disk immediately (just a wild
idea...)?

Thanks in advance.
--
Be well!
Jean Morozov


Re: Convert Iterable to RDD

2016-02-12 Thread Jerry Lam
Not sure if I understand your problem well but why don't you create the file 
locally and then upload to hdfs?

Sent from my iPhone

> On 12 Feb, 2016, at 9:09 am, "seb.arzt"  wrote:
> 
> I have an Iterator of several million elements, which unfortunately won't fit
> into the driver memory at the same time. I would like to save them as object
> file in HDFS:
> 
> Doing so I am running out of memory on the driver:
> 
> Using a stream
> 
> also won't work. I cannot further increase the driver memory. Why doesn't it
> work out of the box? Shouldn't lazy evaluation and garbage collection
> prevent the program from running out of memory? I could manually split the
> Iterator into chunks and serialize each chunk, but it feels wrong. What is
> going wrong here?
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-Iterable-to-RDD-tp16882p26211.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Python3 does not have Module 'UserString'

2016-02-12 Thread Sisyphuss
When trying the `reduceByKey` transformation on Python3.4, I got the
following error:

ImportError: No module named 'UserString'




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python3-does-not-have-Module-UserString-tp26212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-12 Thread p pathiyil
Thanks Sebastian.

I was indeed trying out FAIR scheduling with a high value for
concurrentJobs today.

It does improve the latency seen by the non-hot partitions, even if it does
not provide complete isolation. So it might be an acceptable middle ground.
On 12 Feb 2016 12:18, "Sebastian Piu"  wrote:

> Have you tried using fair scheduler and queues
> On 12 Feb 2016 4:24 a.m., "p pathiyil"  wrote:
>
>> With this setting, I can see that the next job is being executed before
>> the previous one is finished. However, the processing of the 'hot'
>> partition eventually hogs all the concurrent jobs. If there was a way to
>> restrict jobs to be one per partition, then this setting would provide the
>> per-partition isolation.
>>
>> Is there anything in the framework which would give control over that
>> aspect ?
>>
>> Thanks.
>>
>>
>> On Thu, Feb 11, 2016 at 9:55 PM, Cody Koeninger 
>> wrote:
>>
>>> spark.streaming.concurrentJobs
>>>
>>>
>>> see e.g. 
>>> http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming
>>>
>>>
>>> On Thu, Feb 11, 2016 at 9:33 AM, p pathiyil  wrote:
>>>
 Thanks for the response Cody.

 The producers are out of my control, so can't really balance the
 incoming content across the various topics and partitions. The number of
 topics and partitions are quite large and the volume across then not very
 well known ahead of time. So it is quite hard to segregate low and high
 volume topics in to separate driver programs.

 Will look at shuffle / repartition.

 Could you share the setting for starting another batch in parallel ? It
 might be ok to call the 'save' of the processed messages out of order if
 that is the only consequence of this setting.

 When separate DStreams are created per partition (and if union() is not
 called on them), what aspect of the framework still ties the scheduling of
 jobs across the partitions together ? Asking this to see if creating
 multiple threads in the driver and calling createDirectStream per partition
 in those threads can provide isolation.



 On Thu, Feb 11, 2016 at 8:14 PM, Cody Koeninger 
 wrote:

> The real way to fix this is by changing partitioning, so you don't
> have a hot partition.  It would be better to do this at the time you're
> producing messages, but you can also do it with a shuffle / repartition
> during consuming.
>
> There is a setting to allow another batch to start in parallel, but
> that's likely to have unintended consequences.
>
> On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil 
> wrote:
>
>> Hi,
>>
>> I am looking at a way to isolate the processing of messages from each
>> Kafka partition within the same driver.
>>
>> Scenario: A DStream is created with the createDirectStream call by
>> passing in a few partitions. Let us say that the streaming context is
>> defined to have a time duration of 2 seconds. If the processing of 
>> messages
>> from a single partition takes more than 2 seconds (while all the others
>> finish much quicker), it seems that the next set of jobs get scheduled 
>> only
>> after the processing of that last partition. This means that the delay is
>> effective for all partitions and not just the partition that was truly 
>> the
>> cause of the delay. What I would like to do is to have the delay only
>> impact the 'slow' partition.
>>
>> Tried to create one DStream per partition and then do a union of all
>> partitions, (similar to the sample in
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
>> but that didn't seem to help.
>>
>> Please suggest the correct approach to solve this issue.
>>
>> Thanks,
>> Praveen.
>>
>
>

>>>
>>


Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Arkadiusz Bicz
Hi Andy,

I suggest to monitor disk usage and in case it is 90% occupation send
alarm to your support team to solve problem, you should not allow your
production system to go down.

Regarding tools you can try set of software as collectd and Spark ->
Graphite -> Grafana -> https://github.com/pabloa/grafana-alerts. I
have not used grafana-alerts but looks promising.

BR,

Arkadiusz Bicz


On Fri, Feb 12, 2016 at 4:38 PM, Andy Davidson
 wrote:
> Hi Arkadiusz
>
> Do you have any suggestions?
>
> As an engineer I think when I get disk full errors I want the application to
> terminate. Its a lot easier for ops to really there is a problem.
>
>
> Andy
>
>
> From: Arkadiusz Bicz 
> Date: Friday, February 12, 2016 at 1:57 AM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: best practices? spark streaming writing output detecting disk
> full error
>
> Hi,
>
> You need good monitoring tools to send you alarms about disk, network
> or  applications errors, but I think it is general dev ops work not
> very specific to spark or hadoop.
>
> BR,
>
> Arkadiusz Bicz
> https://www.linkedin.com/in/arkadiuszbicz
>
> On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson
>  wrote:
>
> We recently started a Spark/Spark Streaming POC. We wrote a simple streaming
> app in java to collect tweets. We choose twitter because we new we get a lot
> of data and probably lots of burst. Good for stress testing
>
> We spun up  a couple of small clusters using the spark-ec2 script. In one
> cluster we wrote all the tweets to HDFS in a second cluster we write all the
> tweets to S3
>
> We were surprised that our HDFS file system reached 100 % of capacity in a
> few days. This resulted with “all data nodes dead”. We where surprised
> because the actually stream app continued to run. We had no idea we had a
> problem until a day or two after the disk became full when we noticed we
> where missing a lot of data.
>
> We ran into a similar problem with our s3 cluster. We had a permission
> problem and where un able to write any data yet our stream app continued to
> run
>
>
> Spark generated mountains of logs,We are using the stand alone cluster
> manager. All the log levels wind up in the “error” log. Making it hard to
> find real errors and warnings using the web UI. Our app is written in Java
> so my guess is the write errors must be unable. I.E. We did not know in
> advance that they could occur . They are basically undocumented.
>
>
>
> We are a small shop. Running something like splunk would add a lot of
> expense and complexity for us at this stage of our growth.
>
> What are best practices
>
> Kind Regards
>
> Andy
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Ted Yu
See this thread for discussion on related subject:
http://search-hadoop.com/m/q3RTtjkIOr1gHqFb1/dropping+spark+python+2.6=+discuss+dropping+Python+2+6+support

especially comments from Juliet.

On Fri, Feb 12, 2016 at 9:01 AM, Zheng Wendell 
wrote:

> I think this may be also due to the fact that I have multiple copies of
> Python.
> My driver program was using Python3.4.2
> My local slave nodes are using Python3.4.4 (System administrator's version)
>
> On Fri, Feb 12, 2016 at 5:51 PM, Zheng Wendell 
> wrote:
>
>> Sorry, I can no longer reproduce the error.
>> After upgrading Python3.4.2 to Python 3.4.4, the error disappears.
>>
>> Spark release: spark-1.6.0-bin-hadoop2.6
>> code snippet:
>> ```
>> lines = sc.parallelize([5,6,2,8,5,2,4,9,2,1,7,3,4,1,5,8,7,6])
>> pairs = lines.map(lambda x: (x, 1))
>> counts = pairs.reduceByKey(lambda a, b: a + b)
>> counts.collect()
>> ```
>>
>> On Fri, Feb 12, 2016 at 4:26 PM, Ted Yu  wrote:
>>
>>> Can you give a bit more information ?
>>>
>>> release of Spark you use
>>> full error trace
>>> your code snippet
>>>
>>> Thanks
>>>
>>> On Fri, Feb 12, 2016 at 7:22 AM, Sisyphuss 
>>> wrote:
>>>
 When trying the `reduceByKey` transformation on Python3.4, I got the
 following error:

 ImportError: No module named 'UserString'




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Python3-does-not-have-Module-UserString-tp26212.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Seperate Log4j.xml for Spark and Application JAR ( Application vs Spark )

2016-02-12 Thread Ashish Soni
Hi All ,

As per my best understanding we can have only one log4j for both spark and
application as which ever comes first in the classpath takes precedence ,
Is there any way we can keep one in application and one in the spark conf
folder .. is it possible ?

Thanks


Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Zheng Wendell
Sorry, I can no longer reproduce the error.
After upgrading Python3.4.2 to Python 3.4.4, the error disappears.

Spark release: spark-1.6.0-bin-hadoop2.6
code snippet:
```
lines = sc.parallelize([5,6,2,8,5,2,4,9,2,1,7,3,4,1,5,8,7,6])
pairs = lines.map(lambda x: (x, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
counts.collect()
```

On Fri, Feb 12, 2016 at 4:26 PM, Ted Yu  wrote:

> Can you give a bit more information ?
>
> release of Spark you use
> full error trace
> your code snippet
>
> Thanks
>
> On Fri, Feb 12, 2016 at 7:22 AM, Sisyphuss  wrote:
>
>> When trying the `reduceByKey` transformation on Python3.4, I got the
>> following error:
>>
>> ImportError: No module named 'UserString'
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Python3-does-not-have-Module-UserString-tp26212.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


spark-submit: remote protocol vs --py-files

2016-02-12 Thread Jeff Henrikson
Spark users,

I am testing different cluster spinup and batch submission jobs.  Using the 
sequenceiq/spark docker package, I have succeeded in submitting "fat egg" 
(analogous to "fat jar") style python code remotely over YARN.  spark-submit 
--py-files is able to transmit the packaged code to the cluster and run it.

I had to read some source code to ascertain how to get the py-files feature 
working with remote submission.  I noticed that the code path for py-files over 
YARN protocol is considerably different from py-files over standalone protocol. 
 Is the py-files behavior the same over standalone submission protocol, 
compared with YARN submission protocol?

Regards,


Jeff Henrikson



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Zheng Wendell
I think this may be also due to the fact that I have multiple copies of
Python.
My driver program was using Python3.4.2
My local slave nodes are using Python3.4.4 (System administrator's version)

On Fri, Feb 12, 2016 at 5:51 PM, Zheng Wendell 
wrote:

> Sorry, I can no longer reproduce the error.
> After upgrading Python3.4.2 to Python 3.4.4, the error disappears.
>
> Spark release: spark-1.6.0-bin-hadoop2.6
> code snippet:
> ```
> lines = sc.parallelize([5,6,2,8,5,2,4,9,2,1,7,3,4,1,5,8,7,6])
> pairs = lines.map(lambda x: (x, 1))
> counts = pairs.reduceByKey(lambda a, b: a + b)
> counts.collect()
> ```
>
> On Fri, Feb 12, 2016 at 4:26 PM, Ted Yu  wrote:
>
>> Can you give a bit more information ?
>>
>> release of Spark you use
>> full error trace
>> your code snippet
>>
>> Thanks
>>
>> On Fri, Feb 12, 2016 at 7:22 AM, Sisyphuss 
>> wrote:
>>
>>> When trying the `reduceByKey` transformation on Python3.4, I got the
>>> following error:
>>>
>>> ImportError: No module named 'UserString'
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Python3-does-not-have-Module-UserString-tp26212.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Ted Yu
Can you give a bit more information ?

release of Spark you use
full error trace
your code snippet

Thanks

On Fri, Feb 12, 2016 at 7:22 AM, Sisyphuss  wrote:

> When trying the `reduceByKey` transformation on Python3.4, I got the
> following error:
>
> ImportError: No module named 'UserString'
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Python3-does-not-have-Module-UserString-tp26212.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


spark slate IP

2016-02-12 Thread Christopher Bourez
Dears,

is there a way to bind a slave to the public IP (instead of the private IP)

16/02/12 14:54:03 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20160212135403-0009/0 on hostPort 172.31.19.203:39841 with 2 cores,
1024.0 MB RAM

thanks,

C


Re: Spark Submit

2016-02-12 Thread Ashish Soni
it works as below

spark-submit --conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml" --conf
spark.executor.memory=512m

Thanks all for the quick help.



On Fri, Feb 12, 2016 at 10:59 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:

> Try
> spark-submit  --conf "spark.executor.memory=512m" --conf
> "spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml"
>
> Sent from Samsung Mobile.
>
>
>  Original message 
> From: Ted Yu 
> Date:12/02/2016 21:24 (GMT+05:30)
> To: Ashish Soni 
> Cc: user 
> Subject: Re: Spark Submit
>
> Have you tried specifying multiple '--conf key=value' ?
>
> Cheers
>
> On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni 
> wrote:
>
>> Hi All ,
>>
>> How do i pass multiple configuration parameter while spark submit
>>
>> Please help i am trying as below
>>
>> spark-submit  --conf "spark.executor.memory=512m
>> spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"
>>
>> Thanks,
>>
>
>


Spark Submit

2016-02-12 Thread Ashish Soni
Hi All ,

How do i pass multiple configuration parameter while spark submit

Please help i am trying as below

spark-submit  --conf "spark.executor.memory=512m
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"

Thanks,


Re: Spark Submit

2016-02-12 Thread Ted Yu
Have you tried specifying multiple '--conf key=value' ?

Cheers

On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni  wrote:

> Hi All ,
>
> How do i pass multiple configuration parameter while spark submit
>
> Please help i am trying as below
>
> spark-submit  --conf "spark.executor.memory=512m
> spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"
>
> Thanks,
>


spark-shell throws JDBC error after load

2016-02-12 Thread Mich Talebzadeh
I have resolved the hanging issue below by using yarn-client as follows

 

spark-shell --master yarn --deploy-mode client --driver-class-path
/home/hduser/jars/ojdbc6.jar 

 

val channels = sqlContext.read.format("jdbc").options(

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(select * from sh.channels where channel_id = 14)",

"user" -> "sh",

"password" -> "sh")).load

channels.show

 

 

But I am getting this error with channels.show

 

 

16/02/12 16:03:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, rhes564, PROCESS_LOCAL, 1929 bytes)

16/02/12 16:03:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on rhes564:33141 (size: 2.7 KB, free: 1589.8 MB)

16/02/12 16:03:38 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
rhes564): java.sql.SQLException: No suitable driver found for
jdbc:oracle:thin:@rhes564:1521:mydb

at java.sql.DriverManager.getConnection(DriverManager.java:596)

at java.sql.DriverManager.getConnection(DriverManager.java:187)

at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnecto
r$1.apply(JDBCRDD.scala:188)

at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnecto
r$1.apply(JDBCRDD.scala:181)

at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCR
DD.scala:360)

at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scal
a:352)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
45)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
15)

at java.lang.Thread.run(Thread.java:724)

 

16/02/12 16:03:38 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, rhes564, PROCESS_LOCAL, 1929 bytes)

16/02/12 16:03:38 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor rhes564: java.sql.SQLException (No suitable driver found for
jdbc:oracle:thin:@rhes564:1521:mydb) [duplicate 1]

16/02/12 16:03:38 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
2, rhes564, PROCESS_LOCAL, 1929 bytes)

16/02/12 16:03:38 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on
executor rhes564: java.sql.SQLException (No suitable driver found for
jdbc:oracle:thin:@rhes564:1521:mydb) [duplicate 2]

16/02/12 16:03:38 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
3, rhes564, PROCESS_LOCAL, 1929 bytes)

16/02/12 16:03:38 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on
executor rhes564: java.sql.SQLException (No suitable driver found for
jdbc:oracle:thin:@rhes564:1521:mydb) [duplicate 3]

16/02/12 16:03:38 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job

16/02/12 16:03:38 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have
all completed, from pool

16/02/12 16:03:38 INFO YarnScheduler: Cancelling stage 0

16/02/12 16:03:38 INFO DAGScheduler: ResultStage 0 (show at :26)
failed in 1.182 s

16/02/12 16:03:38 INFO DAGScheduler: Job 0 failed: show at :26,
took 1.316319 s

16/02/12 16:03:39 INFO SparkContext: Invoking stop() from shutdown hook

16/02/12 16:03:39 INFO SparkUI: Stopped Spark web UI at
http://50.140.197.217:4040

16/02/12 16:03:39 INFO DAGScheduler: Stopping DAGScheduler

16/02/12 16:03:39 INFO YarnClientSchedulerBackend: Interrupting monitor
thread

16/02/12 16:03:39 INFO YarnClientSchedulerBackend: Shutting down all
executors

16/02/12 16:03:39 INFO YarnClientSchedulerBackend: Asking each executor to
shut down

16/02/12 16:03:39 INFO YarnClientSchedulerBackend: Stopped

 

Dr Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept 

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Andy Davidson
Hi Arkadiusz

Do you have any suggestions?

As an engineer I think when I get disk full errors I want the application to
terminate. Its a lot easier for ops to really there is a problem.


Andy


From:  Arkadiusz Bicz 
Date:  Friday, February 12, 2016 at 1:57 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: best practices? spark streaming writing output detecting disk
full error

> Hi,
> 
> You need good monitoring tools to send you alarms about disk, network
> or  applications errors, but I think it is general dev ops work not
> very specific to spark or hadoop.
> 
> BR,
> 
> Arkadiusz Bicz
> https://www.linkedin.com/in/arkadiuszbicz
> 
> On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson
>  wrote:
>>  We recently started a Spark/Spark Streaming POC. We wrote a simple streaming
>>  app in java to collect tweets. We choose twitter because we new we get a lot
>>  of data and probably lots of burst. Good for stress testing
>> 
>>  We spun up  a couple of small clusters using the spark-ec2 script. In one
>>  cluster we wrote all the tweets to HDFS in a second cluster we write all the
>>  tweets to S3
>> 
>>  We were surprised that our HDFS file system reached 100 % of capacity in a
>>  few days. This resulted with ³all data nodes dead². We where surprised
>>  because the actually stream app continued to run. We had no idea we had a
>>  problem until a day or two after the disk became full when we noticed we
>>  where missing a lot of data.
>> 
>>  We ran into a similar problem with our s3 cluster. We had a permission
>>  problem and where un able to write any data yet our stream app continued to
>>  run
>> 
>> 
>>  Spark generated mountains of logs,We are using the stand alone cluster
>>  manager. All the log levels wind up in the ³error² log. Making it hard to
>>  find real errors and warnings using the web UI. Our app is written in Java
>>  so my guess is the write errors must be unable. I.E. We did not know in
>>  advance that they could occur . They are basically undocumented.
>> 
>> 
>> 
>>  We are a small shop. Running something like splunk would add a lot of
>>  expense and complexity for us at this stage of our growth.
>> 
>>  What are best practices
>> 
>>  Kind Regards
>> 
>>  Andy
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 




Re: Spark Submit

2016-02-12 Thread Jacek Laskowski
Or simply multiple -c.

Jacek
12.02.2016 4:54 PM "Ted Yu"  napisał(a):

> Have you tried specifying multiple '--conf key=value' ?
>
> Cheers
>
> On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni 
> wrote:
>
>> Hi All ,
>>
>> How do i pass multiple configuration parameter while spark submit
>>
>> Please help i am trying as below
>>
>> spark-submit  --conf "spark.executor.memory=512m
>> spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"
>>
>> Thanks,
>>
>
>


Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Andy Davidson
Hi Igor

So I assume you are able to use s3 from spark?

Do you use rdd.saveAsTextFile() ?

How did you create your cluster? I.E. Did you use the spark-1.6.0/spark-ec2
script, EMR, or something else?


I tried several version of the url including no luck :-(

The bucket name is Œcom.ps.twitter¹. It has a folder Œson'

We have a developer support contract with amazon how ever our case has been
unassigned for several days now

Thanks

Andy

P.s. In general debugging permission problems is always difficult from the
client side. Secure servers do not want to make it easy for hackers

From:  Igor Berman 
Date:  Friday, February 12, 2016 at 4:53 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: newbie unable to write to S3 403 forbidden error

>  String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/
>  json²
> 
> not sure, but 
> can you try to remove s3-us-west-1.amazonaws.com
>   from path ?
> 
> On 11 February 2016 at 23:15, Andy Davidson 
> wrote:
>> I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am
>> using the standalone cluster manager
>> 
>> My java streaming app is not able to write to s3. It appears to be some for
>> of permission problem.
>> 
>> Any idea what the problem might be?
>> 
>> I tried use the IAM simulator to test the policy. Everything seems okay. Any
>> idea how I can debug this problem?
>> 
>> Thanks in advance
>> 
>> Andy
>> 
>> JavaSparkContext jsc = new JavaSparkContext(conf);
>> 
>> 
>> // I did not include the full key in my email
>>// the keys do not contain Œ\¹
>>// these are the keys used to create the cluster. They belong to the
>> IAM user andy
>> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAJREX");
>> 
>> jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
>> "uBh9v1hdUctI23uvq9qR");
>> 
>> 
>> 
>> 
>>   private static void saveTweets(JavaDStream jsonTweets, String
>> outputURI) {
>> 
>> jsonTweets.foreachRDD(new VoidFunction2() {
>> 
>> private static final long serialVersionUID = 1L;
>> 
>> 
>> 
>> @Override
>> 
>> public void call(JavaRDD rdd, Time time) throws Exception
>> {
>> 
>> if(!rdd.isEmpty()) {
>> 
>> // bucket name is Œcom.pws.twitter¹ it has a folder Œjson'
>> 
>> String dirPath =
>> "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/
>>  json² + "-" +
>> time.milliseconds();
>> 
>> rdd.saveAsTextFile(dirPath);
>> 
>> }
>> 
>> }
>> 
>> });
>> 
>> 
>> 
>> 
>> Bucket name : com.pws.titter
>> Bucket policy (I replaced the account id)
>> 
>> {
>> "Version": "2012-10-17",
>> "Id": "Policy1455148808376",
>> "Statement": [
>> {
>> "Sid": "Stmt1455148797805",
>> "Effect": "Allow",
>> "Principal": {
>> "AWS": "arn:aws:iam::123456789012:user/andy"
>> },
>> "Action": "s3:*",
>> "Resource": "arn:aws:s3:::com.pws.twitter/*"
>> }
>> ]
>> }
>> 
>> 
> 




Re: Spark Submit

2016-02-12 Thread Diwakar Dhanuskodi
Try 
spark-submit  --conf "spark.executor.memory=512m" --conf 
"spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml"

Sent from Samsung Mobile.

 Original message From: Ted Yu 
 Date:12/02/2016  21:24  (GMT+05:30) 
To: Ashish Soni  Cc: user 
 Subject: Re: Spark Submit 
Have you tried specifying multiple '--conf key=value' ?

Cheers

On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni  wrote:
Hi All , 

How do i pass multiple configuration parameter while spark submit

Please help i am trying as below 

spark-submit  --conf "spark.executor.memory=512m 
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"

Thanks,



Re: [SparkML] RandomForestModel save on disk.

2016-02-12 Thread Eugene Morozov
Here is the exception I discover.

java.lang.RuntimeException: error reading Scala signature of
org.apache.spark.mllib.tree.model.DecisionTreeModel:
scala.reflect.internal.Symbols$PackageClassSymbol cannot be cast to
scala.reflect.internal.Constants$Constant
at
scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:45)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.runtime.JavaMirrors$JavaMirror.unpickleClass(JavaMirrors.scala:565)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.runtime.SymbolLoaders$TopClassCompleter.complete(SymbolLoaders.scala:32)
~[scala-reflect-2.10.4.jar:na]
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:43)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
~[scala-reflect-2.10.4.jar:na]
at
org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$$typecreator1$1.apply(treeEnsembleModels.scala:450)
~[spark-mllib_2.10-1.6.0.jar:1.6.0]
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
~[scala-reflect-2.10.4.jar:na]
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
~[scala-reflect-2.10.4.jar:na]
at
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:642)
~[spark-catalyst_2.10-1.6.0.jar:1.6.0]


--
Be well!
Jean Morozov

On Fri, Feb 12, 2016 at 5:57 PM, Eugene Morozov 
wrote:

> Hello,
>
> I'm building simple web service that works with spark and allows users to
> train random forest model (mlib API) and use it for prediction. Trained
> models are stored on the local file system (web service and spark of just
> one worker are run on the same machine).
> I'm concerned about prediction performance and established small load
> testing to measure prediction latency. That's initially, I will set up hdfs
> and bigger spark cluster.
>
> At first I run training 5 really small models (all of them can finish
> within 30 seconds).
> Next my perf testing framework waits for a minute and start calling
> prediction method.
>
> Sometimes I see that not all of the 5 models were saved on disk. There is
> a metadata folder for them, but not the data directory that actually
> contains parquet files of the models.
>
> I've looked through spark's jira, but haven't found anything similar.
> Has anyone experience smth like this?
> Could you recommend where to look for?
> Might it be something with flushing it to disk immediately (just a wild
> idea...)?
>
> Thanks in advance.
> --
> Be well!
> Jean Morozov
>


coalesce and executor memory

2016-02-12 Thread Christopher Brady
Can anyone help me understand why using coalesce causes my executors to 
crash with out of memory? What happens during coalesce that increases 
memory usage so much?


If I do:
hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile

everything works fine, but if I do:
hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile

my executors crash with out of memory exceptions.

Is there any documentation that explains what causes the increased 
memory requirements with coalesce? It seems to be less of a problem if I 
coalesce into a larger number of partitions, but I'm not sure why this 
is. How would I estimate how much additional memory the coalesce requires?


Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SSE in s3

2016-02-12 Thread Lin, Hao
Hi,

Can we configure Spark to enable SSE (Server Side Encryption) for saving files 
to s3?

much appreciated!

thanks

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Re: off-heap certain operations

2016-02-12 Thread Ovidiu-Cristian MARCU
I found nothing about the certain operations. Still not clear, certain is poor 
documentation. Can someone give an answer so I can consider using this new 
release?
spark.memory.offHeap.enabled

If true, Spark will attempt to use off-heap memory for certain operations.

> On 12 Feb 2016, at 13:21, Ted Yu  wrote:
> 
> SP



Allowing parallelism in spark local mode

2016-02-12 Thread yael aharon
Hello,
I have an application that receives requests over HTTP and uses spark in
local mode to process the requests. Each request is running in its own
thread.
It seems that spark is queueing the jobs, processing them one at a time.
When 2 requests arrive simultaneously, the processing time for each of them
is almost doubled.
I tried setting spark.default.parallelism, spark.executor.cores,
spark.driver.cores but that did not change the time in a meaningful way.

Am I missing something obvious?
thanks, Yael


Re: Using SPARK packages in Spark Cluster

2016-02-12 Thread Burak Yavuz
Hello Gourav,

The packages need to be loaded BEFORE you start the JVM, therefore you
won't be able to add packages dynamically in code. You should use the
--packages with pyspark before you start your application.
One option is to add a `conf` that will load some packages if you are
constantly going to use them.

Best,
Burak



On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta 
wrote:

> Hi,
>
> I am creating sparkcontext in a SPARK standalone cluster as mentioned
> here: http://spark.apache.org/docs/latest/spark-standalone.html using the
> following code:
>
>
> --
> sc.stop()
> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
>   .setMaster("spark://hostname:7077") \
>   .set('spark.shuffle.service.enabled', True) \
>   .set('spark.dynamicAllocation.enabled','true') \
>   .set('spark.executor.memory','20g') \
>   .set('spark.driver.memory', '4g') \
>
> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
> conf.getAll()
> sc = SparkContext(conf = conf)
>
> -(we should definitely be able to optimise the configuration but that
> is not the point here) ---
>
> I am not able to use packages, a list of which is mentioned here
> http://spark-packages.org, using this method.
>
> Where as if I use the standard "pyspark --packages" option then the
> packages load just fine.
>
> I will be grateful if someone could kindly let me know how to load
> packages when starting a cluster as mentioned above.
>
>
> Regards,
> Gourav Sengupta
>


RE: Question on Spark architecture and DAG

2016-02-12 Thread Mich Talebzadeh
Thanks Andy much appreciated

 

Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

From: Andy Davidson [mailto:a...@santacruzintegration.com] 
Sent: 12 February 2016 21:17
To: Mich Talebzadeh ; user@spark.apache.org
Subject: Re: Question on Spark architecture and DAG

 

 

 

From: Mich Talebzadeh  >
Date: Thursday, February 11, 2016 at 2:30 PM
To: "user @spark"  >
Subject: Question on Spark architecture and DAG

 

Hi,

I have used Hive on Spark engine and of course Hive tables and its pretty
impressive comparing Hive using MR engine.

 

Let us assume that I use spark shell. Spark shell is a client that connects
to spark master running on a host and port like below

spark-shell --master spark://50.140.197.217:7077:

Ok once I connect I create an RDD to read a text file:

val oralog = sc.textFile("/test/alert_mydb.log")

I then search for word Errors in that file

oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
println(line))

 

Questions:

 

1.  In order to display the lines (the result set) containing word
"Errors", the content of the file (i.e. the blocks on HDFS) need to be read
into memory. Is my understanding correct that as per RDD notes those blocks
from the file will be partitioned across the cluster and each node will have
its share of blocks in memory?

 

 

Typically results are written to disk. For example look at
rdd.saveAsTextFile(). You can also use "collect" to copy the RDD data into
the drivers local memory. You need to be careful that all the data will fit
in memory.

 

2.   
3.  Once the result is returned back they need to be sent to the client
that has made the connection to master. I guess this is a simple TCP
operation much like any relational database sending the result back?

 

 

I run several spark streaming apps. One collects data, does some clean up
and publishes the results to down stream systems using activeMQ. Some of our
other apps just write on a socket

 

4.   
5.  Once the results are returned if no request has been made to keep
the data in memory, those blocks in memory will be discarded?

 

There are couple of thing to consider, for example if your batch job
completes all memory is returned. Programaticaly you make RDD persistent or
cause them to be cached in memory

 

6.   
7.  Regardless of the storage block size on disk (128MB, 256MB etc), the
memory pages are 2K in relational databases? Is this the case in Spark as
well?

Thanks,

 Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 



Re: Allowing parallelism in spark local mode

2016-02-12 Thread Chris Fregly
sounds like the first job is occupying all resources.  you should limit the
resources that a single job can acquire.

fair scheduler is one way to do that.

a possibly simpler way is to configured spark.deploy.defaultCores or
spark.cores.max?

the defaults for these values - for the Spark default cluster resource
manager (aka Spark Standalone) - is infinite.  every job will try to
acquire every resource.

https://spark.apache.org/docs/latest/spark-standalone.html

here's an example config that i use for my reference data pipeline project:

https://github.com/fluxcapacitor/pipeline/blob/master/config/spark/spark-defaults.conf

i'm always playing with these values to simulate different conditions, but
that's the current snapshot that might be helpful.

also, don't forget about executor memory...


On Fri, Feb 12, 2016 at 1:40 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> You’ll want to setup the FAIR scheduler as described here:
> https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
>
> From: yael aharon 
> Date: Friday, February 12, 2016 at 2:00 PM
> To: "user@spark.apache.org" 
> Subject: Allowing parallelism in spark local mode
>
> Hello,
> I have an application that receives requests over HTTP and uses spark in
> local mode to process the requests. Each request is running in its own
> thread.
> It seems that spark is queueing the jobs, processing them one at a time.
> When 2 requests arrive simultaneously, the processing time for each of them
> is almost doubled.
> I tried setting spark.default.parallelism, spark.executor.cores,
> spark.driver.cores but that did not change the time in a meaningful way.
>
> Am I missing something obvious?
> thanks, Yael
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


GroupedDataset flatMapGroups with sorting (aka secondary sort redux)

2016-02-12 Thread Koert Kuipers
is there a way to leverage the shuffle in Dataset/GroupedDataset so that
Iterator[V] in flatMapGroups has a well defined ordering?

is hard for me to see many good use cases for flatMapGroups and mapGroups
if you do not have sorting.

since spark has a sort based shuffle not exposing this would be a missed
opportunity, not unlike SPARK-3655
. And unlike with RDD
where this could be implemented in an external library without too much
trouble, i think with Dataset it is hard for a spark "user" to add this
functionality.


Dataset takes more memory compared to RDD

2016-02-12 Thread Raghava Mutharaju
Hello All,

I implemented an algorithm using both the RDDs and the Dataset API (in
Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this
normal? Even for very small input data, it is running out of memory and I
get a java heap exception.

I tried the Kryo serializer by registering the classes and I
set spark.kryo.registrationRequired to true. I get the following exception

com.esotericsoftware.kryo.KryoException:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.sql.types.StructField[]
Note: To register this class use:
kryo.register(org.apache.spark.sql.types.StructField[].class);

I tried registering
using conf.registerKryoClasses(Array(classOf[StructField[]]))

But StructField[] does not exist. Is there any other way to register it? I
already registered StructField.

Regards,
Raghava.


_metada file throwing an "GC overhead limit exceeded" after a write

2016-02-12 Thread Maurin Lenglart
Hi,

I am currently using spark in python. I have my master, worker and driver on 
the same machine in different dockers. I am using spark 1.6.
The configuration that I am using look like this :

CONFIG["spark.executor.memory"] = "100g"
CONFIG["spark.executor.cores"] = "11"
CONFIG["spark.cores.max"] = "11"
CONFIG["spark.scheduler.mode"] = "FAIR"
CONFIG["spark.default.parallelism"] = “60"

I am doing a sql query and writing the result in one partitioned table.The code 
look like this :

df = self.sqlContext.sql(selectsql)
parquet_dir = self.dir_for_table(tablename)
df.write.partitionBy(partition_name).mode(mode).parquet(parquet_dir)

The code works and my partition get created correctly. But it is always 
throwing an exception (see bellow).

What make me thinks that it is a problem wit the _metadata file is because the 
exception is thrown after the partition is created. And when I do a ls –ltr on 
my folder the _metadata file is the last one that get modified and the size is 
zero.

Any ideas why?

Thanks


The exception:


16/02/12 14:08:21 INFO ParseDriver: Parse Completed
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Exception in thread "qtp1919278883-98" java.lang.OutOfMemoryError: GC overhead 
limit exceeded
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.addWaiter(AbstractQueuedSynchronizer.java:606)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:247)
at 
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:544)
at java.lang.Thread.run(Thread.java:745)
An error occurred while calling o57.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:154)
at 

pyspark.DataFrame.dropDuplicates

2016-02-12 Thread James Barney
Hi all,
Just wondering what the actual logic governing DataFrame.dropDuplicates()
is? For example:

>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ Row(name='Alice',
age=5, height=80, itemsInPocket=[pen, pencil, paper]), \
Row(name='Alice', age=5, height=80), itemsInPocket=[pen, pencil, paper])\
Row(name='Alice', age=10, height=80, itemsInPocket=[pen, pencil])]).toDF()
>>> df.dropDuplicates().show() +---+--+-+ |age|height| name|
itemsInPocket +---+--+-+ - | 5| 80|Alice| [pen, pencil,
paper] | 10| 80|Alice| [pen, pencil] +---+--+-+ - >>>
df.dropDuplicates(['name', 'height']).show() +---+--+-+ |age|height|
name| itemsInPocket +---+--+-+  | 5| 80|Alice| [pen,
pencil, paper] +---+--+-+
What determines which row is kept and which is deleted? First to appear? Or
random?

I would like to guarantee that the row with the longest list itemsInPocket
is kept. How can I do that?

Thanks,

James


Re: Allowing parallelism in spark local mode

2016-02-12 Thread Silvio Fiorito
You’ll want to setup the FAIR scheduler as described here: 
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

From: yael aharon >
Date: Friday, February 12, 2016 at 2:00 PM
To: "user@spark.apache.org" 
>
Subject: Allowing parallelism in spark local mode

Hello,
I have an application that receives requests over HTTP and uses spark in local 
mode to process the requests. Each request is running in its own thread.
It seems that spark is queueing the jobs, processing them one at a time. When 2 
requests arrive simultaneously, the processing time for each of them is almost 
doubled.
I tried setting spark.default.parallelism, spark.executor.cores, 
spark.driver.cores but that did not change the time in a meaningful way.

Am I missing something obvious?
thanks, Yael



Re: coalesce and executor memory

2016-02-12 Thread Silvio Fiorito
Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like?




On 2/12/16, 1:13 PM, "Christopher Brady"  wrote:

>Can anyone help me understand why using coalesce causes my executors to 
>crash with out of memory? What happens during coalesce that increases 
>memory usage so much?
>
>If I do:
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>
>everything works fine, but if I do:
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile
>
>my executors crash with out of memory exceptions.
>
>Is there any documentation that explains what causes the increased 
>memory requirements with coalesce? It seems to be less of a problem if I 
>coalesce into a larger number of partitions, but I'm not sure why this 
>is. How would I estimate how much additional memory the coalesce requires?
>
>Thanks.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


Re: coalesce and executor memory

2016-02-12 Thread Koert Kuipers
in spark, every partition needs to fit in the memory available to the core
processing it.

as you coalesce you reduce number of partitions, increasing partition size.
at some point the partition no longer fits in memory.

On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Coalesce essentially reduces parallelism, so fewer cores are getting more
> records. Be aware that it could also lead to loss of data locality,
> depending on how far you reduce. Depending on what you’re doing in the map
> operation, it could lead to OOM errors. Can you give more details as to
> what the code for the map looks like?
>
>
>
>
> On 2/12/16, 1:13 PM, "Christopher Brady" 
> wrote:
>
> >Can anyone help me understand why using coalesce causes my executors to
> >crash with out of memory? What happens during coalesce that increases
> >memory usage so much?
> >
> >If I do:
> >hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
> >
> >everything works fine, but if I do:
> >hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile
> >
> >my executors crash with out of memory exceptions.
> >
> >Is there any documentation that explains what causes the increased
> >memory requirements with coalesce? It seems to be less of a problem if I
> >coalesce into a larger number of partitions, but I'm not sure why this
> >is. How would I estimate how much additional memory the coalesce requires?
> >
> >Thanks.
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: Question on Spark architecture and DAG

2016-02-12 Thread Andy Davidson


From:  Mich Talebzadeh 
Date:  Thursday, February 11, 2016 at 2:30 PM
To:  "user @spark" 
Subject:  Question on Spark architecture and DAG

> Hi,
> 
> I have used Hive on Spark engine and of course Hive tables and its pretty
> impressive comparing Hive using MR engine.
> 
>  
> 
> Let us assume that I use spark shell. Spark shell is a client that connects to
> spark master running on a host and port like below
> 
> spark-shell --master spark://50.140.197.217:7077:
> 
> Ok once I connect I create an RDD to read a text file:
> 
> val oralog = sc.textFile("/test/alert_mydb.log")
> 
> I then search for word Errors in that file
> 
> oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
> println(line))
> 
>  
> 
> Questions:
> 
>  
> 1. In order to display the lines (the result set) containing word "Errors",
> the content of the file (i.e. the blocks on HDFS) need to be read into memory.
> Is my understanding correct that as per RDD notes those blocks from the file
> will be partitioned across the cluster and each node will have its share of
> blocks in memory?


Typically results are written to disk. For example look at
rdd.saveAsTextFile(). You can also use ³collect² to copy the RDD data into
the drivers local memory. You need to be careful that all the data will fit
in memory.

> 1. 
> 2. Once the result is returned back they need to be sent to the client that
> has made the connection to master. I guess this is a simple TCP operation much
> like any relational database sending the result back?


I run several spark streaming apps. One collects data, does some clean up
and publishes the results to down stream systems using activeMQ. Some of our
other apps just write on a socket

> 1. 
> 2. Once the results are returned if no request has been made to keep the data
> in memory, those blocks in memory will be discarded?

There are couple of thing to consider, for example if your batch job
completes all memory is returned. Programaticaly you make RDD persistent or
cause them to be cached in memory

> 1. 
> 2. Regardless of the storage block size on disk (128MB, 256MB etc), the memory
> pages are 2K in relational databases? Is this the case in Spark as well?
> Thanks,
> 
>  Mich Talebzadeh
> 
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8
> Pw 
>  8Pw> 
>  
> http://talebzadehmich.wordpress.com 
>  
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Technology Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>  
>  




Spark with DF throws No suitable driver found for jdbc:oracle: after first call

2016-02-12 Thread Mich Talebzadeh
First I put the Oracle JAR file in spark-shell start up and also in
CLASSPATH

 

spark-shell --master yarn --deploy-mode client --driver-class-path
/home/hduser/jars/ojdbc6.jar

 

Now it shows clearly that load call is successful as shown in bold so it can
use the driver. However, the next method call channels.collect throws error!

 

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.SQLContext@33fb6c08

 

scala> val channels = sqlContext.read.format("jdbc").options(

 | Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

 | "dbtable" -> "(select * from sh.channels where channel_id = 14)",

 | "user" -> "sh",

 | "password" -> "sh")).load

channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127),
CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID:
decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]

 

scala> channels.collect

16/02/12 22:48:19 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, rhes564): java.sql.SQLException: No suitable driver found for
jdbc:oracle:thin:@rhes564:1521:mydb

 

Dr Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 



new to Spark - trying to get a basic example to run - could use some help

2016-02-12 Thread Taylor, Ronald C
Hello folks,

This is my first msg to the list. New to Spark, and trying to run the SparkPi 
example shown in the Cloudera documentation.  We have Cloudera 5.5.1 running on 
a small cluster at our lab, with Spark 1.5.

My trial invocation is given below. The output that I get *says* that I 
"SUCCEEDED" at the end. But - I don't get any screen output on the value of pi. 
I also tried a SecondarySort Spark program  that I compiled and jarred from Dr. 
Parsian's Data Algorithms book. That program  failed. So - I am focusing on 
getting SparkPi to work properly, to get started. Can somebody look at the 
screen output that I cut-and-pasted below and infer what I might be doing wrong?

Am I forgetting to set one or more environment variables? Or not setting such 
properly?

Here is the CLASSPATH value that I set:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils

Here is the settings of other environment variables:

HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar'
SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils'

I am not sure that those env vars are properly set (or if even all of them are 
needed). But that's what I'm currently using.

As I said, the invocation below appears to terminate with final status set to 
"SUCCEEDED". But - there is no screen output on the value of pi, which I 
understood would be shown. So - something appears to be going wrong. I went to 
the tracking URL given at the end, but could not access it.

I would very much appreciate some guidance!

-   Ron Taylor

%

INVOCATION:

[rtaylor@bigdatann]$ spark-submit   --class org.apache.spark.examples.SparkPi   
 --master yarn--deploy-mode cluster --name RT_SparkPi 
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
10

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at 
bigdatann.ib/172.17.115.18:8032
16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from cluster 
with 15 NodeManagers
16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (65536 MB per container)
16/02/12 18:16:59 INFO yarn.Client: Will allocate AM container, with 1408 MB 
memory including 384 MB overhead
16/02/12 18:16:59 INFO yarn.Client: Setting up container launch context for our 
AM
16/02/12 18:16:59 INFO yarn.Client: Setting up the launch environment for our 
AM container
16/02/12 18:16:59 INFO yarn.Client: Preparing resources for our AM container
16/02/12 18:17:00 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/02/12 18:17:00 INFO yarn.Client: Uploading resource 
file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
 -> 
hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
16/02/12 18:17:21 INFO yarn.Client: Uploading resource 
file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
 -> 
hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
16/02/12 18:17:23 INFO yarn.Client: Uploading resource 
file:/tmp/spark-141bf8a4-2f4b-49d3-b041-61070107e4de/__spark_conf__8357851336386157291.zip
 -> 
hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/__spark_conf__8357851336386157291.zip
16/02/12 18:17:23 INFO spark.SecurityManager: Changing view acls to: rtaylor
16/02/12 18:17:23 INFO spark.SecurityManager: 

Re: coalesce and executor memory

2016-02-12 Thread Christopher Brady
Thank you for the responses. The map function just changes the format of 
the record slightly, so I don't think that would be the cause of the 
memory problem.


So if I have 3 cores per executor, I need to be able to fit 3 partitions 
per executor within whatever I specify for the executor memory? Is there 
a way I can programmatically find a number of partitions I can coalesce 
down to without running out of memory? Is there some documentation where 
this is explained?



On 02/12/2016 05:10 PM, Koert Kuipers wrote:
in spark, every partition needs to fit in the memory available to the 
core processing it.


as you coalesce you reduce number of partitions, increasing partition 
size. at some point the partition no longer fits in memory.


On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito 
> 
wrote:


Coalesce essentially reduces parallelism, so fewer cores are
getting more records. Be aware that it could also lead to loss of
data locality, depending on how far you reduce. Depending on what
you’re doing in the map operation, it could lead to OOM errors.
Can you give more details as to what the code for the map looks like?




On 2/12/16, 1:13 PM, "Christopher Brady"
> wrote:

>Can anyone help me understand why using coalesce causes my
executors to
>crash with out of memory? What happens during coalesce that increases
>memory usage so much?
>
>If I do:
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>
>everything works fine, but if I do:
>hadoopFile -> sample -> coalesce -> cache -> map ->
saveAsNewAPIHadoopFile
>
>my executors crash with out of memory exceptions.
>
>Is there any documentation that explains what causes the increased
>memory requirements with coalesce? It seems to be less of a
problem if I
>coalesce into a larger number of partitions, but I'm not sure why
this
>is. How would I estimate how much additional memory the coalesce
requires?
>
>Thanks.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

>For additional commands, e-mail: user-h...@spark.apache.org

>






org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-12 Thread Andy Davidson
I am trying to add a column with a constant value to my data frame. Any idea
what I am doing wrong?

Kind regards

Andy


 DataFrame result = Š
 String exprStr = "lit(" + time.milliseconds()+ ") as ms";

 logger.warn("AEDWIP expr: {}", exprStr);

  result.selectExpr("*", exprStr).show(false);


WARN  02:06:17 streaming-job-executor-0 c.p.f.s.s.CalculateAggregates$1 call
line:96 AEDWIP expr: lit(1455329175000) as ms

ERROR 02:06:17 JobScheduler o.a.s.Logging$class logError line:95 Error
running job streaming job 1455329175000 ms.0

org.apache.spark.sql.AnalysisException: undefined function lit;








Re: Computing hamming distance over large data set

2016-02-12 Thread Charlie Hack
I ran across DIMSUM a while ago but never used it.

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

Annoy is wonderful if you want to make queries.

If you want to do the "self similarity join" you might look at DIMSUM or
preferably if at all possible see if there's some key that you can join
possible pairs and then use a similarity metric to filter out non matches.
Does that make sense? In general way more efficient then computing n^2
similarities.

Hth

Charlie
On Fri, Feb 12, 2016 at 20:57 Maciej Szymkiewicz 
wrote:

> There is also this: https://github.com/soundcloud/cosine-lsh-join-spark
>
>
> On 02/11/2016 10:12 PM, Brian Morton wrote:
>
> Karl,
>
> This is tremendously useful.  Thanks very much for your insight.
>
> Brian
>
> On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley  wrote:
>
>> Hi,
>>
>> It sounds like you're trying to solve the approximate nearest neighbor
>> (ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
>> approach isn't likely to help all that much, because the number of pairwise
>> comparisons grows quickly as the size of the dataset increases. I'd look at
>> ways to avoid computing the similarity between all pairs, like
>> locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
>> it's currently slated for the Spark 2.0.0 release, but AFAIK development on
>> it hasn't started yet.)
>>
>> There are a bunch of Python libraries that support various approaches to
>> the ANN problem (including LSH), though. It sounds like you need fast
>> lookups, so you might check out https://github.com/spotify/annoy. For
>> other alternatives, see this performance comparison of Python ANN libraries
>> : https://github.com/erikbern/ann-benchmarks.
>>
>> Hope that helps,
>> Karl
>>
>> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15  wrote:
>>
>>> Hi everyone, new to this list and Spark, so I'm hoping someone can point
>>> me
>>> in the right direction.
>>>
>>> I'm trying to perform this same sort of task:
>>>
>>> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>>>
>>> and I'm running into the same problem - it doesn't scale.  Even on a very
>>> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
>>> find a match with 30 million+ rows.
>>>
>>> What I would like to do is to load this data set from MySQL into Spark
>>> and
>>> compute the Hamming distance using all available cores, then select the
>>> rows
>>> matching a maximum distance.  I'm most familiar with Python, so would
>>> prefer
>>> to use that.
>>>
>>> I found an example of loading data from MySQL
>>>
>>>
>>> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>>>
>>> I found a related DataFrame commit and docs, but I'm not exactly sure
>>> how to
>>> put this all together.
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>>>
>>>
>>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>>>
>>> Could anyone please point me to a similar example I could follow as a
>>> Spark
>>> newb to try this out?  Is this even worth attempting, or will it
>>> similarly
>>> fail performance-wise?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
> --
> Maciej Szymkiewicz
>
>


How to write Array[Byte] as JPG file in Spark?

2016-02-12 Thread Liangzhao Zeng
Hello All

I have RDD[(id:String, image:Array[Byte])] and would like to write the
image attribute as a jpg file into HDFS. Any suggestions?

Cheers,

LZ


Re: Computing hamming distance over large data set

2016-02-12 Thread Maciej Szymkiewicz
There is also this: https://github.com/soundcloud/cosine-lsh-join-spark

On 02/11/2016 10:12 PM, Brian Morton wrote:
> Karl,
>
> This is tremendously useful.  Thanks very much for your insight.
>
> Brian
>
> On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley  > wrote:
>
> Hi,
>
> It sounds like you're trying to solve the approximate nearest
> neighbor (ANN) problem. With a large dataset, parallelizing a
> brute force O(n^2) approach isn't likely to help all that much,
> because the number of pairwise comparisons grows quickly as the
> size of the dataset increases. I'd look at ways to avoid computing
> the similarity between all pairs, like locality-sensitive hashing.
> (Unfortunately Spark doesn't yet support LSH -- it's currently
> slated for the Spark 2.0.0 release, but AFAIK development on it
> hasn't started yet.)
>
> There are a bunch of Python libraries that support various
> approaches to the ANN problem (including LSH), though. It sounds
> like you need fast lookups, so you might check out
> https://github.com/spotify/annoy. For other alternatives, see this
> performance comparison of Python ANN
> libraries: https://github.com/erikbern/ann-benchmarks.
>
> Hope that helps,
> Karl
>
> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15  > wrote:
>
> Hi everyone, new to this list and Spark, so I'm hoping someone
> can point me
> in the right direction.
>
> I'm trying to perform this same sort of task:
> 
> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>
> and I'm running into the same problem - it doesn't scale. 
> Even on a very
> fast processor, MySQL pegs out one CPU core at 100% and takes
> 8 hours to
> find a match with 30 million+ rows.
>
> What I would like to do is to load this data set from MySQL
> into Spark and
> compute the Hamming distance using all available cores, then
> select the rows
> matching a maximum distance.  I'm most familiar with Python,
> so would prefer
> to use that.
>
> I found an example of loading data from MySQL
>
> 
> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>
> I found a related DataFrame commit and docs, but I'm not
> exactly sure how to
> put this all together.
>
> 
> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>
> 
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>
> Could anyone please point me to a similar example I could
> follow as a Spark
> newb to try this out?  Is this even worth attempting, or will
> it similarly
> fail performance-wise?
>
> Thanks!
>
>
>
> --
> View this message in context:
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> 
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>
>

-- 
Maciej Szymkiewicz



signature.asc
Description: OpenPGP digital signature


Sharing temporary table

2016-02-12 Thread max.tenerowicz
This video    suggests that
registerTempTable can be used to share table between sessions. Is it
Databricks platform specific feature or can do something like this in
general?

Best, 
Max



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sharing-temporary-table-tp26214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dataset GroupedDataset.reduce

2016-02-12 Thread Koert Kuipers
i see that currently GroupedDataset.reduce simply calls flatMapgroups. does
this mean that there is currently no partial aggregation for reduce?


Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-12 Thread pdesai
Hi there,

I am doing a POC with Spark and I have noticed that if I run my job on
standalone spark installation, it finishes in a second(It's a small sample
job). But when I run same job on spark cluster with Yarn, it takes 4-5 min
in simple execution. 
Are there any best practices that I need to follow for spark cluster
configuration. I have left all default settings. During spark-submit I
specify num-executors=3, executor-memory=512m, executor-cores-1.

I am using Java Spark SQL API.

Thanks,
Purvi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jobs-run-extremely-slow-on-yarn-cluster-compared-to-standalone-spark-tp26215.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



support vector machine does not classify properly?

2016-02-12 Thread prem09
Hi,
I created a dataset of 100 points, ranging from X=1.0 to to X=100.0. I let
the y variable be 0.0 if X < 51.0 and 1.0 otherwise. I then fit a
SVMwithSGD. When I predict the y values for the same values of X as in the
sample, I get back 1.0 for each predicted y! 

Incidentally, I don't get perfect separation when I replace SVMwithSGD with
LogisticRegressionWithSGD or NaiveBayes.

Here's the code:


import sys
from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithSGD,
LogisticRegressionModel
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint
import numpy as np

# Load a text file and convert each line to a tuple.
sc=SparkContext(appName="Prem")

# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split('\t')]
return LabeledPoint(values[0], values[1:])

data = sc.textFile("c:/python27/classifier.txt")
parsedData = data.map(parsePoint)
print parsedData

# Build the model
model = SVMWithSGD.train(parsedData, iterations=100)
model.setThreshold(0.5)
print model

### Build the model
##model = LogisticRegressionWithSGD.train(parsedData, iterations=100,
intercept=True)
##print model

### Build the model
##model = NaiveBayes.train(parsedData)
##print model

for i in range(100):
print i+1, model.predict(np.array([float(i+1)]))

=

Incidentally, the weights I observe in MLlib are 0.8949991, while if I run
it using the scikit-learn library version of support vector machine, I get
0.05417109. Is this indicative of the problem?
Can you please let me know what I am doing wrong?

Thanks,
Prem



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/support-vector-machine-does-not-classify-properly-tp26216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org