Re: Mllib Logistic Regression performance relative to Mahout

2016-02-28 Thread Yashwanth Kumar
Hi,
If your features are numeric, try feature scaling and feed it to Spark
Logistic Regression, It might increase rate%



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346p26358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Integration Patterns

2016-02-28 Thread Yashwanth Kumar
Hi, 
To connect to Spark from a remote location and submit jobs, you can try
Spark - Job Server.Its been open sourced now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354p26357.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
Hi Jean,

DataFrame is connected with SQLContext which is connected with
SparkContext, so I think it's impossible to run `model.transform` without
touching Spark.
I think what you need is model should support prediction on single
instance, then you can make prediction without Spark. You can track the
progress of https://issues.apache.org/jira/browse/SPARK-10413.

Thanks
Yanbo

2016-02-27 8:52 GMT+08:00 Eugene Morozov :

> Hi everyone.
>
> I have a requirement to run prediction for random forest model locally on
> a web-service without touching spark at all in some specific cases. I've
> achieved that with previous mllib API (java 8 syntax):
>
> public List> predictLocally(RandomForestModel
> model, List data) {
> return data.stream()
> .map(point -> new
> Tuple2<>(model.predict(point.features()), point.label()))
> .collect(Collectors.toList());
> }
>
> So I have instance of trained model and can use it any way I want.
> The question is whether it's possible to run this on the driver itself
> with the following:
> DataFrame predictions = model.transform(test);
> because AFAIU test has to be a DataFrame, which means it's going to be run
> on the cluster.
>
> The use case to run it on driver is very small amount of data for
> prediction - much faster to handle it this way, than using spark cluster.
> Thank you.
> --
> Be well!
> Jean Morozov
>


Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-28 Thread Jules Damji

Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name 
to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C  wrote:
> 
> 
> Hello folks,
> 
> I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at 
> our lab. I am trying to use the PySpark shell for the first time. and am 
> attempting to  duplicate the documentation example of creating an RDD  which 
> I called "lines" using a text file.
> 
> I placed a a text file called Warehouse.java in this HDFS location:
> 
> [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
> -rw-r--r--   3 rtaylor supergroup1155355 2016-02-28 18:09 
> /user/rtaylor/Spark/Warehouse.java
> [rtaylor@bigdatann ~]$ 
> 
> I then invoked sc.textFile()in the PySpark shell.That did not work. See 
> below. Apparently a class is not found? Don't know why that would be the 
> case. Any guidance would be very much appreciated.
> 
> The Cloudera Manager for the cluster says that Spark is operating  in the 
> "green", for whatever that is worth.
> 
>  - Ron Taylor
> 
> >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")
> 
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
>  line 451, in textFile
> return RDD(self._jsc.textFile(name, minPartitions), self,
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
> : java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.spark.rdd.RDDOperationScope$
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
> at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
> at 
> org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> 
> >>> 


Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions

2016-02-28 Thread Hossein Vatani
Hi,
Affects Version/s:1.6.0
Component/s:PySpark

I faced below exception when I tried to run
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=filter#pyspark.sql.SQLContext.jsonRDD
samples:
Exception: Python in worker has different version 2.7 than that in driver
3.5, PySpark cannot run with different minor versions
my OS is : CentOS 7 and I installed anaconda3,also I have to keep python2.7
for some another application, run Spark with:
PYSPARK_DRIVER_PYTHON=ipython3 pyspark
I have not any config regarding "python" or "ipython" in my profile or
spark-defualt.conf .
could you please assist me?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-in-worker-has-different-version-2-7-than-that-in-driver-3-5-PySpark-cannot-run-with-differents-tp26356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



a basic question on first use of PySpark shell and example, which is failing

2016-02-28 Thread Taylor, Ronald C

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at 
our lab. I am trying to use the PySpark shell for the first time. and am 
attempting to  duplicate the documentation example of creating an RDD  which I 
called "lines" using a text file.

I placed a a text file called Warehouse.java in this HDFS location:

[rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
-rw-r--r--   3 rtaylor supergroup1155355 2016-02-28 18:09 
/user/rtaylor/Spark/Warehouse.java
[rtaylor@bigdatann ~]$

I then invoked sc.textFile()in the PySpark shell.That did not work. See below. 
Apparently a class is not found? Don't know why that would be the case. Any 
guidance would be very much appreciated.

The Cloudera Manager for the cluster says that Spark is operating  in the 
"green", for whatever that is worth.

 - Ron Taylor

>>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")

Traceback (most recent call last):
  File "", line 1, in 
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
at 
org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

>>>


RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Mohammed Guller
Hi Ashok,

Another book recommendation (I am the author): “Big Data Analytics with Spark”

The first half of the book is specifically written for people just getting 
started with Big Data and Spark.

Mohammed
Author: Big Data Analytics with 
Spark

From: Suhaas Lang [mailto:suhaas.l...@gmail.com]
Sent: Sunday, February 28, 2016 6:21 PM
To: Jules Damji
Cc: Ashok Kumar; User
Subject: Re: Recommendation for a good book on Spark, beginner to moderate 
knowledge


Thanks, Jules!
On Feb 28, 2016 7:47 PM, "Jules Damji" 
> wrote:
Suhass,

When I referred to interactive shells, I was referring the the Scala & Python 
interactive language shells. Both Python & Scala come with respective 
interacive shells. By just typing “python” or “scala” (assume the installation 
bin directory is in your $PATH), it will put fire up the shell.

As for the “pyspark” and “spark-shell”, they both come with the Spark 
installation and are in $spark_install_dir/bin directory.

Have a go at them. Best way to learn the language.

Cheers
Jules

--
“Language is the palate from which we draw all colors of our life.”
Jules Damji
dmat...@comcast.net




On Feb 28, 2016, at 4:08 PM, Suhaas Lang 
> wrote:


Jules,

Could you please post links to these interactive shells for Python and Scala?
On Feb 28, 2016 5:32 PM, "Jules Damji" 
> wrote:
Hello Ashoka,

"Learning Spark," from O'Reilly, is certainly a good start, and all basic video 
tutorials from Spark Summit Training, "Spark Essentials", are excellent 
supplementary materials.

And the best (and most effective) way to teach yourself is really firing up the 
spark-shell or pyspark and doing it yourself—immersing yourself by trying all 
basic transformations and actions on RDDs, with contrived small data sets.

I've discovered that learning Scala & Python through their interactive shell, 
where feedback is immediate and response is quick, as the best learning 
experience.

Same is true for Scala or Python Notebooks interacting with a Spark, running in 
local or cluster mode.

Cheers,

Jules

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 1:48 PM, Ashok Kumar 
> wrote:
  Hi Gurus,

Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge

I very much like to skill myself on transformation and action methods.

FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.

Warmest regards



Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
Hi Raj,

If you choose JSON as the storage format, Spark SQL will store VectorUDT as
Array of Double.
So when you load back to memory, it can not be recognized as Vector.
One workaround is storing the DataFrame as parquet format, it will be
loaded and recognized as expected.

df.write.format("parquet").mode("overwrite").save(output)
> val data = sqlContext.read.format("parquet").load(output)


Thanks
Yanbo

2016-02-27 2:01 GMT+08:00 Raj Kumar :

> Thanks for the response Yanbo. Here is the source (it uses the
> sample_libsvm_data.txt file used in the
> mlliv examples).
>
> -Raj
> — IOTest.scala -
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.DataFrame
>
> object IOTest {
>   val InputFile = "/tmp/sample_libsvm_data.txt"
>   val OutputDir ="/tmp/out"
>
>   val sconf = new SparkConf().setAppName("test").setMaster("local[*]")
>   val sqlc  = new SQLContext( new SparkContext( sconf ))
>   val df = sqlc.read.format("libsvm").load( InputFile  )
>   df.show; df.printSchema
>
>   df.write.format("json").mode("overwrite").save( OutputDir )
>   val data = sqlc.read.format("json").load( OutputDir )
>   data.show; data.printSchema
>
>   def main( args: Array[String]):Unit = {}
> }
>
>
> ---
>
> On Feb 26, 2016, at 12:47 AM, Yanbo Liang  wrote:
>
> Hi Raj,
>
> Could you share your code which can help others to diagnose this issue?
> Which version did you use?
> I can not reproduce this problem in my environment.
>
> Thanks
> Yanbo
>
> 2016-02-26 10:49 GMT+08:00 raj.kumar :
>
>> Hi,
>>
>> I am using mllib. I use the ml vectorization tools to create the
>> vectorized
>> input dataframe for
>> the ml/mllib machine-learning models with schema:
>>
>> root
>>  |-- label: double (nullable = true)
>>  |-- features: vector (nullable = true)
>>
>> To avoid repeated vectorization, I am trying to save and load this
>> dataframe
>> using
>>df.write.format("json").mode("overwrite").save( url )
>> val data = Spark.sqlc.read.format("json").load( url )
>>
>> However when I load the dataframe, the newly loaded dataframe has the
>> following schema:
>> root
>>  |-- features: struct (nullable = true)
>>  ||-- indices: array (nullable = true)
>>  |||-- element: long (containsNull = true)
>>  ||-- size: long (nullable = true)
>>  ||-- type: long (nullable = true)
>>  ||-- values: array (nullable = true)
>>  |||-- element: double (containsNull = true)
>>  |-- label: double (nullable = true)
>>
>> which the machine-learning models do not recognize.
>>
>> Is there a way I can save and load this dataframe without the schema
>> changing.
>> I assume it has to do with the fact that Vector is not a basic type.
>>
>> thanks
>> -Raj
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Suhaas Lang
Thanks, Jules!
On Feb 28, 2016 7:47 PM, "Jules Damji"  wrote:

> Suhass,
>
> When I referred to interactive shells, I was referring the the Scala &
> Python interactive language shells. Both Python & Scala come with
> respective interacive shells. By just typing “python” or “scala” (assume
> the installation bin directory is in your $PATH), it will put fire up the
> shell.
>
> As for the “pyspark” and “spark-shell”, they both come with the Spark
> installation and are in $spark_install_dir/bin directory.
>
> Have a go at them. Best way to learn the language.
>
> Cheers
> Jules
>
> --
> “Language is the palate from which we draw all colors of our life.”
> Jules Damji
> dmat...@comcast.net
>
>
>
>
>
> On Feb 28, 2016, at 4:08 PM, Suhaas Lang  wrote:
>
> Jules,
>
> Could you please post links to these interactive shells for Python and
> Scala?
> On Feb 28, 2016 5:32 PM, "Jules Damji"  wrote:
>
>> Hello Ashoka,
>>
>> "Learning Spark," from O'Reilly, is certainly a good start, and all basic
>> video tutorials from Spark Summit Training, "Spark Essentials", are
>> excellent supplementary materials.
>>
>> And the best (and most effective) way to teach yourself is really firing
>> up the spark-shell or pyspark and doing it yourself—immersing yourself by
>> trying all basic transformations and actions on RDDs, with contrived small
>> data sets.
>>
>> I've discovered that learning Scala & Python through their interactive
>> shell, where feedback is immediate and response is quick, as the best
>> learning experience.
>>
>> Same is true for Scala or Python Notebooks interacting with a Spark,
>> running in local or cluster mode.
>>
>> Cheers,
>>
>> Jules
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Feb 28, 2016, at 1:48 PM, Ashok Kumar > > wrote:
>>
>>   Hi Gurus,
>>
>> Appreciate if you recommend me a good book on Spark or documentation for
>> beginner to moderate knowledge
>>
>> I very much like to skill myself on transformation and action methods.
>>
>> FYI, I have already looked at examples on net. However, some of them not
>> clear at least to me.
>>
>> Warmest regards
>>
>>
>


Error when trying to overwrite a partition dynamically in Spark SQL

2016-02-28 Thread SRK
Hi,

I am getting an error when trying to overwrite a partition dynamically.
Following is the code and the error. Any idea as to why this is happening?



test.write.partitionBy("dtPtn","idPtn").mode(SaveMode.Overwrite).format("parquet").save("/user/test/sessRecs")

16/02/28 18:02:55 ERROR input.FileInputFormat: Exception while trying to
createSplits
java.util.concurrent.ExecutionException:
java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:337)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-trying-to-overwrite-a-partition-dynamically-in-Spark-SQL-tp26355.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Thank you Ashwin.

On Sun, Feb 28, 2016 at 7:19 PM, Ashwin Giridharan 
wrote:

> Hi Vishnu,
>
> A partition will either be in memory or in disk.
>
> -Ashwin
> On Feb 28, 2016 15:09, "Vishnu Viswanath" 
> wrote:
>
>> Hi All,
>>
>> I have a question regarding Persistence (MEMORY_AND_DISK)
>>
>> Suppose I am trying to persist an RDD which has 2 partitions and only 1
>> partition can be fit in memory completely but some part of partition 2 can
>> also be fit, will spark keep the portion of partition 2 in memory and rest
>> in disk, or will the whole 2nd partition be kept in disk.
>>
>> Regards,
>> Vishnu
>>
>


-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com *


Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Ashwin Giridharan
Hi Vishnu,

A partition will either be in memory or in disk.

-Ashwin
On Feb 28, 2016 15:09, "Vishnu Viswanath" 
wrote:

> Hi All,
>
> I have a question regarding Persistence (MEMORY_AND_DISK)
>
> Suppose I am trying to persist an RDD which has 2 partitions and only 1
> partition can be fit in memory completely but some part of partition 2 can
> also be fit, will spark keep the portion of partition 2 in memory and rest
> in disk, or will the whole 2nd partition be kept in disk.
>
> Regards,
> Vishnu
>


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Jules Damji
Suhass,

When I referred to interactive shells, I was referring the the Scala & Python 
interactive language shells. Both Python & Scala come with respective 
interacive shells. By just typing “python” or “scala” (assume the installation 
bin directory is in your $PATH), it will put fire up the shell.

As for the “pyspark” and “spark-shell”, they both come with the Spark 
installation and are in $spark_install_dir/bin directory.

Have a go at them. Best way to learn the language.

Cheers
Jules

--
“Language is the palate from which we draw all colors of our life.”
Jules Damji
dmat...@comcast.net





> On Feb 28, 2016, at 4:08 PM, Suhaas Lang  wrote:
> 
> Jules,
> 
> Could you please post links to these interactive shells for Python and Scala?
> 
> On Feb 28, 2016 5:32 PM, "Jules Damji"  > wrote:
> Hello Ashoka,
> 
> "Learning Spark," from O'Reilly, is certainly a good start, and all basic 
> video tutorials from Spark Summit Training, "Spark Essentials", are excellent 
> supplementary materials.
> 
> And the best (and most effective) way to teach yourself is really firing up 
> the spark-shell or pyspark and doing it yourself—immersing yourself by trying 
> all basic transformations and actions on RDDs, with contrived small data sets.
> 
> I've discovered that learning Scala & Python through their interactive shell, 
> where feedback is immediate and response is quick, as the best learning 
> experience. 
> 
> Same is true for Scala or Python Notebooks interacting with a Spark, running 
> in local or cluster mode. 
> 
> Cheers,
> 
> Jules 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
> On Feb 28, 2016, at 1:48 PM, Ashok Kumar  > wrote:
> 
>>   Hi Gurus,
>> 
>> Appreciate if you recommend me a good book on Spark or documentation for 
>> beginner to moderate knowledge
>> 
>> I very much like to skill myself on transformation and action methods.
>> 
>> FYI, I have already looked at examples on net. However, some of them not 
>> clear at least to me.
>> 
>> Warmest regards



Pattern Matching over a Sequence of rows using Spark

2016-02-28 Thread Jerry Lam
Hi spark users and developers,

Anyone has experience developing pattern matching over a sequence of rows
using Spark? I'm talking about functionality similar to matchpath in Hive
or match_recognize in Oracle DB. It is used for path analysis on
clickstream data. If you know of any libraries that do that, please share
your findings!

Thank you,

Jerry


Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang
data skew might be possible, but not the common case. I think we should
design for the common case, for the skew case, we may can set some
parameter of fraction to allow user to tune it.

On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin  wrote:

> But sometimes you might have skew and almost all the result data are in
> one or a few tasks though.
>
>
> On Friday, February 26, 2016, Jeff Zhang  wrote:
>
>>
>> My job get this exception very easily even when I set large value of
>> spark.driver.maxResultSize. After checking the spark code, I found
>> spark.driver.maxResultSize is also used in Executor side to decide whether
>> DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me.
>> Using  spark.driver.maxResultSize / taskNum might be more proper. Because
>> if  spark.driver.maxResultSize is 1g and we have 10 tasks each has 200m
>> output. Then even the output of each task is less than
>>  spark.driver.maxResultSize so DirectTaskResult will be sent to driver, but
>> the total result size is 2g which will cause exception in driver side.
>>
>>
>> 16/02/26 10:10:49 INFO DAGScheduler: Job 4 failed: treeAggregate at
>> LogisticRegression.scala:283, took 33.796379 s
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Total size of serialized results of 1 tasks (1085.0
>> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang


Re: PySpark : couldn't pickle object of type class T

2016-02-28 Thread Jeff Zhang
Hi Anoop,

I don't see the exception you mentioned in the link. I can use spark-avro
to read the sample file users.avro in spark successfully. Do you have the
details of the union issue ?



On Sat, Feb 27, 2016 at 10:05 AM, Anoop Shiralige  wrote:

> Hi Jeff,
>
> Thank you for looking into the post.
>
> I had explored spark-avro option earlier. Since, we have union of multiple
> complex data types in our avro schema we couldn't use it.
> Couple of things I tried.
>
>-
>
> https://stackoverflow.com/questions/31261376/how-to-read-pyspark-avro-file-and-extract-the-values
>  :
>"Spark Exception : Unions may only consist of concrete type and null"
>- Use of dataFrame/DataSet : serialization problem.
>
> For now, I got it working by modifing AvroConversionUtils, to address the
> union of multiple data-types.
>
> Thanks,
> AnoopShiralige
>
>
> On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang  wrote:
>
>> Avro Record is not supported by pickler, you need to create a custom
>> pickler for it.  But I don't think it worth to do that. Actually you can
>> use package spark-avro to load avro data and then convert it to RDD if
>> necessary.
>>
>> https://github.com/databricks/spark-avro
>>
>>
>> On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige <
>> anoop.shiral...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am working with Spark 1.6.0 and pySpark shell specifically. I have an
>>> JavaRDD[org.apache.avro.GenericRecord] which I have converted to
>>> pythonRDD
>>> in the following way.
>>>
>>> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
>>> javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
>>> from pyspark.rdd import RDD
>>> pythonRDD=RDD(javaPython,sc)
>>>
>>> pythonRDD.first()
>>>
>>> However everytime I am trying to call collect() or first() method on
>>> pythonRDD I am getting the following error :
>>>
>>> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
>>> unexpectedly (crashed)
>>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>>> last):
>>>   File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
>>> line 98, in main
>>> command = pickleSer._read_with_length(infile)
>>>   File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>>> line 156, in _read_with_length
>>> length = read_int(stream)
>>>   File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>>> line 545, in read_int
>>> raise EOFError
>>> EOFError
>>>
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>>> at
>>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>>> at
>>> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object
>>> of
>>> type class org.apache.avro.generic.GenericData$Record
>>> at net.razorvine.pickle.Pickler.save(Pickler.java:142)
>>> at
>>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
>>> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
>>> at net.razorvine.pickle.Pickler.save(Pickler.java:137)
>>> at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
>>> at net.razorvine.pickle.Pickler.dumps(Pickler.java:92)
>>> at
>>>
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
>>> at
>>>
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at
>>>
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
>>> at
>>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>>> at
>>>
>>> 

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Suhaas Lang
Jules,

Could you please post links to these interactive shells for Python and
Scala?
On Feb 28, 2016 5:32 PM, "Jules Damji"  wrote:

> Hello Ashoka,
>
> "Learning Spark," from O'Reilly, is certainly a good start, and all basic
> video tutorials from Spark Summit Training, "Spark Essentials", are
> excellent supplementary materials.
>
> And the best (and most effective) way to teach yourself is really firing
> up the spark-shell or pyspark and doing it yourself—immersing yourself by
> trying all basic transformations and actions on RDDs, with contrived small
> data sets.
>
> I've discovered that learning Scala & Python through their interactive
> shell, where feedback is immediate and response is quick, as the best
> learning experience.
>
> Same is true for Scala or Python Notebooks interacting with a Spark,
> running in local or cluster mode.
>
> Cheers,
>
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Feb 28, 2016, at 1:48 PM, Ashok Kumar  > wrote:
>
>   Hi Gurus,
>
> Appreciate if you recommend me a good book on Spark or documentation for
> beginner to moderate knowledge
>
> I very much like to skill myself on transformation and action methods.
>
> FYI, I have already looked at examples on net. However, some of them not
> clear at least to me.
>
> Warmest regards
>
>


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Chris Fregly
for hands-on, check out the end-to-end reference data pipeline available
either from the github or docker repo's described here:

http://advancedspark.com/

i use these assets to training folks of all levels of Spark knowledge.

also, some relevant videos and slideshare presentations, but might be a bit
advanced for Spark n00bs.


On Sun, Feb 28, 2016 at 4:25 PM, Mich Talebzadeh 
wrote:

> In my opinion the best way to learn something is trying it on the spot.
>
> As suggested if you have Hadoop, Hive and Spark installed and you are OK
> with SQL then you will have to focus on Scala and Spark pretty much.
>
> Your best bet is interactive work through Spark shell with Scala,
> understanding RDD, DataFrame, Transformation and actions. You also have
> online docs and a great number of users in this forum that can potentially
> help you with your questions. Buying books can help but nothing takes the
> place of getting your hands dirty so to speak.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 February 2016 at 22:32, Jules Damji  wrote:
>
>> Hello Ashoka,
>>
>> "Learning Spark," from O'Reilly, is certainly a good start, and all basic
>> video tutorials from Spark Summit Training, "Spark Essentials", are
>> excellent supplementary materials.
>>
>> And the best (and most effective) way to teach yourself is really firing
>> up the spark-shell or pyspark and doing it yourself—immersing yourself by
>> trying all basic transformations and actions on RDDs, with contrived small
>> data sets.
>>
>> I've discovered that learning Scala & Python through their interactive
>> shell, where feedback is immediate and response is quick, as the best
>> learning experience.
>>
>> Same is true for Scala or Python Notebooks interacting with a Spark,
>> running in local or cluster mode.
>>
>> Cheers,
>>
>> Jules
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Feb 28, 2016, at 1:48 PM, Ashok Kumar > > wrote:
>>
>>   Hi Gurus,
>>
>> Appreciate if you recommend me a good book on Spark or documentation for
>> beginner to moderate knowledge
>>
>> I very much like to skill myself on transformation and action methods.
>>
>> FYI, I have already looked at examples on net. However, some of them not
>> clear at least to me.
>>
>> Warmest regards
>>
>>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Mich Talebzadeh
In my opinion the best way to learn something is trying it on the spot.

As suggested if you have Hadoop, Hive and Spark installed and you are OK
with SQL then you will have to focus on Scala and Spark pretty much.

Your best bet is interactive work through Spark shell with Scala,
understanding RDD, DataFrame, Transformation and actions. You also have
online docs and a great number of users in this forum that can potentially
help you with your questions. Buying books can help but nothing takes the
place of getting your hands dirty so to speak.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 February 2016 at 22:32, Jules Damji  wrote:

> Hello Ashoka,
>
> "Learning Spark," from O'Reilly, is certainly a good start, and all basic
> video tutorials from Spark Summit Training, "Spark Essentials", are
> excellent supplementary materials.
>
> And the best (and most effective) way to teach yourself is really firing
> up the spark-shell or pyspark and doing it yourself—immersing yourself by
> trying all basic transformations and actions on RDDs, with contrived small
> data sets.
>
> I've discovered that learning Scala & Python through their interactive
> shell, where feedback is immediate and response is quick, as the best
> learning experience.
>
> Same is true for Scala or Python Notebooks interacting with a Spark,
> running in local or cluster mode.
>
> Cheers,
>
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Feb 28, 2016, at 1:48 PM, Ashok Kumar  > wrote:
>
>   Hi Gurus,
>
> Appreciate if you recommend me a good book on Spark or documentation for
> beginner to moderate knowledge
>
> I very much like to skill myself on transformation and action methods.
>
> FYI, I have already looked at examples on net. However, some of them not
> clear at least to me.
>
> Warmest regards
>
>


Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Hi All,

I have a question regarding Persistence (MEMORY_AND_DISK)

Suppose I am trying to persist an RDD which has 2 partitions and only 1
partition can be fit in memory completely but some part of partition 2 can
also be fit, will spark keep the portion of partition 2 in memory and rest
in disk, or will the whole 2nd partition be kept in disk.

Regards,
Vishnu


Re: Spark Integration Patterns

2016-02-28 Thread ayan guha
I believe you are looking  for something like Spark Jobserver for running
jobs & JDBC server for accessing data? I am curious to know more about it,
any further discussion will be very helpful

On Mon, Feb 29, 2016 at 6:06 AM, Luciano Resende 
wrote:

> One option we have used in the past is to expose spark application
> functionality via REST, this would enable python or any other client that
> is capable of doing a HTTP request to integrate with your Spark application.
>
> To get you started, this might be a useful reference
>
>
> http://blog.michaelhamrah.com/2013/06/scala-web-apis-up-and-running-with-spray-and-akka/
>
>
> On Sun, Feb 28, 2016 at 10:38 AM, moshir mikael 
> wrote:
>
>> Ok,
>> but what do I need for the program to run.
>> In python  sparkcontext  = SparkContext(conf) only works when you have
>> spark installed locally.
>> AFAIK there is no *pyspark *package for python that you can install
>> doing pip install pyspark.
>> You actually need to install spark to get it running (e.g :
>> https://github.com/KristianHolsheimer/pyspark-setup-guide).
>>
>> Does it mean you need to install spark on the box your applications runs
>> to benefit from pyspark and this is required to connect to another remote
>> spark cluster ?
>> Am I missing something obvious ?
>>
>>
>> Le dim. 28 févr. 2016 à 19:01, Todd Nist  a écrit :
>>
>>> Define your SparkConfig to set the master:
>>>
>>>   val conf = new SparkConf().setAppName(AppName)
>>> .setMaster(SparkMaster)
>>> .set()
>>>
>>> Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
>>> server hostname it "RADTech" then it would be "spark://RADTech:7077".
>>>
>>> Then when you create the SparkContext, pass the SparkConf  to it:
>>>
>>> val sparkContext = new SparkContext(conf)
>>>
>>> Then use the sparkContext for interact with the SparkMaster / Cluster.
>>> Your program basically becomes the driver.
>>>
>>> HTH.
>>>
>>> -Todd
>>>
>>> On Sun, Feb 28, 2016 at 9:25 AM, mms  wrote:
>>>
 Hi, I cannot find a simple example showing how a typical application
 can 'connect' to a remote spark cluster and interact with it. Let's say I
 have a Python web application hosted somewhere *outside *a spark
 cluster, with just python installed on it. How can I talk to Spark without
 using a notebook, or using ssh to connect to a cluster master node ? I know
 of spark-submit and spark-shell, however forking a process on a remote host
 to execute a shell script seems like a lot of effort What are the
 recommended ways to connect and query Spark from a remote client ? Thanks
 Thx !
 --
 View this message in context: Spark Integration Patterns
 
 Sent from the Apache Spark User List mailing list archive
  at Nabble.com.

>>>
>>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>



-- 
Best Regards,
Ayan Guha


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Jules Damji
Hello Ashoka,

"Learning Spark," from O'Reilly, is certainly a good start, and all basic video 
tutorials from Spark Summit Training, "Spark Essentials", are excellent 
supplementary materials.

And the best (and most effective) way to teach yourself is really firing up the 
spark-shell or pyspark and doing it yourself—immersing yourself by trying all 
basic transformations and actions on RDDs, with contrived small data sets.

I've discovered that learning Scala & Python through their interactive shell, 
where feedback is immediate and response is quick, as the best learning 
experience. 

Same is true for Scala or Python Notebooks interacting with a Spark, running in 
local or cluster mode. 

Cheers,

Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Feb 28, 2016, at 1:48 PM, Ashok Kumar  wrote:
> 
>   Hi Gurus,
> 
> Appreciate if you recommend me a good book on Spark or documentation for 
> beginner to moderate knowledge
> 
> I very much like to skill myself on transformation and action methods.
> 
> FYI, I have already looked at examples on net. However, some of them not 
> clear at least to me.
> 
> Warmest regards


Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Nicos
I agree with suggestion to start with "Learning Spark" to further forge your 
knowledge of Spark fundamentals.

"Advanced Analytics with Spark" has good practical reinforcement of what you 
learn from the previous book. Though it is a bit advanced, in my opinion some 
practical/real applications are better covered in this book.

For DataFrame and other online Apache Spark documentation is still the best 
source.

Keep in mind Spark and its different subsystems are constantly evolving. 
Publications will be always somewhat outdated but not the key fundamental 
concepts.

Cheers,
- Nicos
+++ 


> On Feb 28, 2016, at 1:53 PM, Michał Zieliński  
> wrote:
> 
> Most of the books are outdated (don't include DataFrames or Spark ML and 
> focus on RDDs and MLlib). The one I particularly liked is "Learning Spark". 
> It starts from the basics, but has lots of useful tips on caching, 
> serialization etc.
> 
> The online docs are also of great quality.
> 
>> On 28 February 2016 at 21:48, Ashok Kumar  
>> wrote:
>>   Hi Gurus,
>> 
>> Appreciate if you recommend me a good book on Spark or documentation for 
>> beginner to moderate knowledge
>> 
>> I very much like to skill myself on transformation and action methods.
>> 
>> FYI, I have already looked at examples on net. However, some of them not 
>> clear at least to me.
>> 
>> Warmest regards
> 


Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-28 Thread Shixiong(Ryan) Zhu
This is because the Snappy library cannot load the native library. Did you
forget to install the snappy native library in your new machines?

On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand 
wrote:

> Any insights on this ?
>
> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand 
> wrote:
>
>> On changing the default compression codec which is snappy to lzf the
>> errors are gone !!
>>
>> How can I fix this using snappy as the codec ?
>>
>> Is there any downside of using lzf as snappy is the default codec that
>> ships with spark.
>>
>>
>> Thanks !!!
>> Abhi
>>
>> On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand 
>> wrote:
>>
>>> Hi ,
>>>
>>> I am getting the following exception on running my spark streaming job.
>>>
>>> The same job has been running fine since long and when I added two new
>>> machines to my cluster I see the job failing with the following exception.
>>>
>>>
>>>
>>> 16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage 4229.0
>>> (TID 22594)
>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1257)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>>> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:59)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>>> at
>>> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
>>> at
>>> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
>>> at org.apache.spark.broadcast.TorrentBroadcast.org
>>> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:167)
>>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1254)
>>> ... 11 more
>>> Caused by: java.lang.IllegalArgumentException
>>> at
>>> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:152)
>>> ... 20 more
>>>
>>>
>>>
>>> Thanks !!!
>>> Abhi
>>>
>>
>>
>


Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-28 Thread Shixiong(Ryan) Zhu
Sorry that I forgot to tell you that you should also call `rdd.count()` for
"reduceByKey" as well. Could you try it and see if it works?

On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand 
wrote:

> Hi Ryan,
>
> I am using mapWithState after doing reduceByKey.
>
> I am right now using mapWithState as you suggested and triggering the
> count manually.
>
> But, still unable to see any checkpointing taking place. In the DAG I can
> see that the reduceByKey operation for the previous batches are also being
> computed.
>
>
> Thanks
> Abhi
>
>
> On Tue, Feb 23, 2016 at 2:36 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Hey Abhi,
>>
>> Using reducebykeyandwindow and mapWithState will trigger the bug
>> in SPARK-6847. Here is a workaround to trigger checkpoint manually:
>>
>> JavaMapWithStateDStream<...> stateDStream =
>> myPairDstream.mapWithState(StateSpec.function(mappingFunc));
>> stateDStream.foreachRDD(new Function1<...>() {
>>   @Override
>>   public Void call(JavaRDD<...> rdd) throws Exception {
>> rdd.count();
>>   }
>> });
>> return stateDStream.stateSnapshots();
>>
>>
>> On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand > > wrote:
>>
>>> Hi Ryan,
>>>
>>> Reposting the code.
>>>
>>> Basically my use case is something like - I am receiving the web
>>> impression logs and may get the notify (listening from kafka) for those
>>> impressions in the same interval (for me its 1 min) or any next interval
>>> (upto 2 hours). Now, when I receive notify for a particular impression I
>>> need to swap the date field in impression with the date field in notify
>>> logs. The notify for an impression has the same key as impression.
>>>
>>> static Function3>> Tuple2> mappingFunc =
>>> new Function3>> MyClass>>() {
>>> @Override
>>> public Tuple2 call(String key, Optional one,
>>> State state) {
>>> MyClass nullObj = new MyClass();
>>> nullObj.setImprLog(null);
>>> nullObj.setNotifyLog(null);
>>> MyClass current = one.or(nullObj);
>>>
>>> if(current!= null && current.getImprLog() != null &&
>>> current.getMyClassType() == 1 /*this is impression*/){
>>> return new Tuple2<>(key, null);
>>> }
>>> else if (current.getNotifyLog() != null  && current.getMyClassType() ==
>>> 3 /*notify for the impression received*/){
>>> MyClass oldState = (state.exists() ? state.get() : nullObj);
>>> if(oldState!= null && oldState.getNotifyLog() != null){
>>> oldState.getImprLog().setCrtd(current.getNotifyLog().getCrtd());
>>>  //swappping the dates
>>> return new Tuple2<>(key, oldState);
>>> }
>>> else{
>>> return new Tuple2<>(key, null);
>>> }
>>> }
>>> else{
>>> return new Tuple2<>(key, null);
>>> }
>>>
>>> }
>>> };
>>>
>>>
>>> return
>>> myPairDstream.mapWithState(StateSpec.function(mappingFunc)).stateSnapshots();
>>>
>>>
>>> Currently I am using reducebykeyandwindow without the inverse function
>>> and I am able to get the correct data. But, issue the might arise is when I
>>> have to restart my application from checkpoint and it repartitions and
>>> computes the previous 120 partitions, which delays the incoming batches.
>>>
>>>
>>> Thanks !!
>>> Abhi
>>>
>>> On Tue, Feb 23, 2016 at 1:25 AM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Hey Abhi,

 Could you post how you use mapWithState? By default, it should do
 checkpointing every 10 batches.
 However, there is a known issue that prevents mapWithState from
 checkpointing in some special cases:
 https://issues.apache.org/jira/browse/SPARK-6847

 On Mon, Feb 22, 2016 at 5:47 AM, Abhishek Anand <
 abhis.anan...@gmail.com> wrote:

> Any Insights on this one ?
>
>
> Thanks !!!
> Abhi
>
> On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand <
> abhis.anan...@gmail.com> wrote:
>
>> I am now trying to use mapWithState in the following way using some
>> example codes. But, by looking at the DAG it does not seem to checkpoint
>> the state and when restarting the application from checkpoint, it
>> re-partitions all the previous batches data from kafka.
>>
>> static Function3> Tuple2> mappingFunc =
>> new Function3> Tuple2>() {
>> @Override
>> public Tuple2 call(String key, Optional
>> one, State state) {
>> MyClass nullObj = new MyClass();
>> nullObj.setImprLog(null);
>> nullObj.setNotifyLog(null);
>> MyClass current = one.or(nullObj);
>>
>> if(current!= null && current.getImprLog() != null &&
>> current.getMyClassType() == 1){
>> return new Tuple2<>(key, null);
>> }
>> else if (current.getNotifyLog() != null  && current.getMyClassType()
>> == 3){
>> MyClass oldState = (state.exists() ? state.get() : 

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ted Yu
http://www.amazon.com/Scala-Spark-Alexy-Khrabrov/dp/1491929286/ref=sr_1_1?ie=UTF8=1456696284=8-1=spark+dataframe

There is another one from Wiley (to be published on March 21):

"Spark: Big Data Cluster Computing in Production," written by Ilya Ganelin,
Brennon York, Kai Sasaki, and Ema Orhian

On Sun, Feb 28, 2016 at 1:48 PM, Ashok Kumar 
wrote:

>   Hi Gurus,
>
> Appreciate if you recommend me a good book on Spark or documentation for
> beginner to moderate knowledge
>
> I very much like to skill myself on transformation and action methods.
>
> FYI, I have already looked at examples on net. However, some of them not
> clear at least to me.
>
> Warmest regards
>


Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ashok Kumar
  Hi Gurus,
Appreciate if you recommend me a good book on Spark or documentation for 
beginner to moderate knowledge
I very much like to skill myself on transformation and action methods.
FYI, I have already looked at examples on net. However, some of them not clear 
at least to me.
Warmest regards

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-28 Thread Koert Kuipers
i find it particularly confusing that a new memory management module would
change the locations. its not like the hash partitioner got replaced. i can
switch back and forth between legacy and "new" memory management and see
the distribution change... fully reproducible

On Sun, Feb 28, 2016 at 11:24 AM, Lior Chaga  wrote:

> Hi,
> I've experienced a similar problem upgrading from spark 1.4 to spark 1.6.
> The data is not evenly distributed across executors, but in my case it
> also reproduced with legacy mode.
> Also tried 1.6.1 rc-1, with same results.
>
> Still looking for resolution.
>
> Lior
>
> On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers  wrote:
>
>> looking at the cached rdd i see a similar story:
>> with useLegacyMode = true the cached rdd is spread out across 10
>> executors, but with useLegacyMode = false the data for the cached rdd sits
>> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
>> RDD that got partitioned (hash partitioner, 50 partitions) before being
>> cached.
>>
>> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers  wrote:
>>
>>> hello all,
>>> we are just testing a semi-realtime application (it should return
>>> results in less than 20 seconds from cached RDDs) on spark 1.6.0. before
>>> this it used to run on spark 1.5.1
>>>
>>> in spark 1.6.0 the performance is similar to 1.5.1 if i set
>>> spark.memory.useLegacyMode = true, however if i switch to
>>> spark.memory.useLegacyMode = false the queries take about 50% to 100% more
>>> time.
>>>
>>> the issue becomes clear when i focus on a single stage: the individual
>>> tasks are not slower at all, but they run on less executors.
>>> in my test query i have 50 tasks and 10 executors. both with
>>> useLegacyMode = true and useLegacyMode = false the tasks finish in about 3
>>> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
>>> false the tasks run on just 3 executors out of 10, while with useLegacyMode
>>> = true they spread out across 10 executors. all the tasks running on just a
>>> few executors leads to the slower results.
>>>
>>> any idea why this would happen?
>>> thanks! koert
>>>
>>>
>>>
>>
>


Re: Spark Integration Patterns

2016-02-28 Thread Luciano Resende
One option we have used in the past is to expose spark application
functionality via REST, this would enable python or any other client that
is capable of doing a HTTP request to integrate with your Spark application.

To get you started, this might be a useful reference

http://blog.michaelhamrah.com/2013/06/scala-web-apis-up-and-running-with-spray-and-akka/


On Sun, Feb 28, 2016 at 10:38 AM, moshir mikael 
wrote:

> Ok,
> but what do I need for the program to run.
> In python  sparkcontext  = SparkContext(conf) only works when you have
> spark installed locally.
> AFAIK there is no *pyspark *package for python that you can install doing
> pip install pyspark.
> You actually need to install spark to get it running (e.g :
> https://github.com/KristianHolsheimer/pyspark-setup-guide).
>
> Does it mean you need to install spark on the box your applications runs
> to benefit from pyspark and this is required to connect to another remote
> spark cluster ?
> Am I missing something obvious ?
>
>
> Le dim. 28 févr. 2016 à 19:01, Todd Nist  a écrit :
>
>> Define your SparkConfig to set the master:
>>
>>   val conf = new SparkConf().setAppName(AppName)
>> .setMaster(SparkMaster)
>> .set()
>>
>> Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
>> server hostname it "RADTech" then it would be "spark://RADTech:7077".
>>
>> Then when you create the SparkContext, pass the SparkConf  to it:
>>
>> val sparkContext = new SparkContext(conf)
>>
>> Then use the sparkContext for interact with the SparkMaster / Cluster.
>> Your program basically becomes the driver.
>>
>> HTH.
>>
>> -Todd
>>
>> On Sun, Feb 28, 2016 at 9:25 AM, mms  wrote:
>>
>>> Hi, I cannot find a simple example showing how a typical application can
>>> 'connect' to a remote spark cluster and interact with it. Let's say I have
>>> a Python web application hosted somewhere *outside *a spark cluster,
>>> with just python installed on it. How can I talk to Spark without using a
>>> notebook, or using ssh to connect to a cluster master node ? I know of
>>> spark-submit and spark-shell, however forking a process on a remote host to
>>> execute a shell script seems like a lot of effort What are the recommended
>>> ways to connect and query Spark from a remote client ? Thanks Thx !
>>> --
>>> View this message in context: Spark Integration Patterns
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>>  at Nabble.com.
>>>
>>
>>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist
I'm not sure on Python, not expert in that area.  Based on pr,
https://github.com/apache/spark/pull/8318, I believe you are correct that
Spark would need to be installed for you to be able to currently leverage
the pyspark package.

On Sun, Feb 28, 2016 at 1:38 PM, moshir mikael 
wrote:

> Ok,
> but what do I need for the program to run.
> In python  sparkcontext  = SparkContext(conf) only works when you have
> spark installed locally.
> AFAIK there is no *pyspark *package for python that you can install doing
> pip install pyspark.
> You actually need to install spark to get it running (e.g :
> https://github.com/KristianHolsheimer/pyspark-setup-guide).
>
> Does it mean you need to install spark on the box your applications runs
> to benefit from pyspark and this is required to connect to another remote
> spark cluster ?
> Am I missing something obvious ?
>
>
> Le dim. 28 févr. 2016 à 19:01, Todd Nist  a écrit :
>
>> Define your SparkConfig to set the master:
>>
>>   val conf = new SparkConf().setAppName(AppName)
>> .setMaster(SparkMaster)
>> .set()
>>
>> Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
>> server hostname it "RADTech" then it would be "spark://RADTech:7077".
>>
>> Then when you create the SparkContext, pass the SparkConf  to it:
>>
>> val sparkContext = new SparkContext(conf)
>>
>> Then use the sparkContext for interact with the SparkMaster / Cluster.
>> Your program basically becomes the driver.
>>
>> HTH.
>>
>> -Todd
>>
>> On Sun, Feb 28, 2016 at 9:25 AM, mms  wrote:
>>
>>> Hi, I cannot find a simple example showing how a typical application can
>>> 'connect' to a remote spark cluster and interact with it. Let's say I have
>>> a Python web application hosted somewhere *outside *a spark cluster,
>>> with just python installed on it. How can I talk to Spark without using a
>>> notebook, or using ssh to connect to a cluster master node ? I know of
>>> spark-submit and spark-shell, however forking a process on a remote host to
>>> execute a shell script seems like a lot of effort What are the recommended
>>> ways to connect and query Spark from a remote client ? Thanks Thx !
>>> --
>>> View this message in context: Spark Integration Patterns
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>>  at Nabble.com.
>>>
>>
>>


Re: output the datas(txt)

2016-02-28 Thread Chandeep Singh
Here is what you can do.

// Recreating your RDD
val a = Array(Array(1, 2, 3), Array(2, 3, 4), Array(3, 4, 5), Array(6, 7, 8))
val b = sc.parallelize(a)

val c = b.map(x => (x(0) + " " + x(1) + " " + x(2)))

// Collect to 
c.collect()
—> res3: Array[String] = Array(1 2 3, 2 3 4, 3 4 5, 6 7 8)

c.saveAsTextFile("test”)
—>
[csingh ~]$ hadoop fs -cat test/*
1 2 3
2 3 4
3 4 5
6 7 8

> On Feb 28, 2016, at 1:20 AM, Bonsen  wrote:
> 
> I get results from RDDs,
> like :
> Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))
> how can I output them to 1.txt
> like :
> 1 2 3
> 2 3 4
> 3 4 6
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/output-the-datas-txt-tp26350.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist
Define your SparkConfig to set the master:

  val conf = new SparkConf().setAppName(AppName)
.setMaster(SparkMaster)
.set()

Where SparkMaster = "spark://SparkServerHost:7077".  So if your spark
server hostname it "RADTech" then it would be "spark://RADTech:7077".

Then when you create the SparkContext, pass the SparkConf  to it:

val sparkContext = new SparkContext(conf)

Then use the sparkContext for interact with the SparkMaster / Cluster.
Your program basically becomes the driver.

HTH.

-Todd

On Sun, Feb 28, 2016 at 9:25 AM, mms  wrote:

> Hi, I cannot find a simple example showing how a typical application can
> 'connect' to a remote spark cluster and interact with it. Let's say I have
> a Python web application hosted somewhere *outside *a spark cluster, with
> just python installed on it. How can I talk to Spark without using a
> notebook, or using ssh to connect to a cluster master node ? I know of
> spark-submit and spark-shell, however forking a process on a remote host to
> execute a shell script seems like a lot of effort What are the recommended
> ways to connect and query Spark from a remote client ? Thanks Thx !
> --
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: output the datas(txt)

2016-02-28 Thread Don Drake
If you use the spark-csv package:

$ spark-shell  --packages com.databricks:spark-csv_2.11:1.3.0



scala> val df =
sc.parallelize(Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))).map(x =>
(x(0), x(1), x(2))).toDF()
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int]

scala> df.write.format("com.databricks.spark.csv").option("delimiter", "
").save("1.txt")

$ cat 1.txt/*
1 2 3
2 3 4
3 4 6


-Don


On Sat, Feb 27, 2016 at 7:20 PM, Bonsen  wrote:

> I get results from RDDs,
> like :
> Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))
> how can I output them to 1.txt
> like :
> 1 2 3
> 2 3 4
> 3 4 6
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/output-the-datas-txt-tp26350.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-28 Thread Lior Chaga
Hi,
I've experienced a similar problem upgrading from spark 1.4 to spark 1.6.
The data is not evenly distributed across executors, but in my case it also
reproduced with legacy mode.
Also tried 1.6.1 rc-1, with same results.

Still looking for resolution.

Lior

On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers  wrote:

> looking at the cached rdd i see a similar story:
> with useLegacyMode = true the cached rdd is spread out across 10
> executors, but with useLegacyMode = false the data for the cached rdd sits
> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
> RDD that got partitioned (hash partitioner, 50 partitions) before being
> cached.
>
> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers  wrote:
>
>> hello all,
>> we are just testing a semi-realtime application (it should return results
>> in less than 20 seconds from cached RDDs) on spark 1.6.0. before this it
>> used to run on spark 1.5.1
>>
>> in spark 1.6.0 the performance is similar to 1.5.1 if i set
>> spark.memory.useLegacyMode = true, however if i switch to
>> spark.memory.useLegacyMode = false the queries take about 50% to 100% more
>> time.
>>
>> the issue becomes clear when i focus on a single stage: the individual
>> tasks are not slower at all, but they run on less executors.
>> in my test query i have 50 tasks and 10 executors. both with
>> useLegacyMode = true and useLegacyMode = false the tasks finish in about 3
>> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
>> false the tasks run on just 3 executors out of 10, while with useLegacyMode
>> = true they spread out across 10 executors. all the tasks running on just a
>> few executors leads to the slower results.
>>
>> any idea why this would happen?
>> thanks! koert
>>
>>
>>
>


Spark Integration Patterns

2016-02-28 Thread mms
Hi,I cannot find a simple example showing how a typical application can
'connect' to a remote spark cluster and interact with it.Let's say I have a
Python web application hosted somewhere *outside *a spark cluster, with just
python installed on it.How can I talk to Spark without using a notebook, or
using ssh to connect to a cluster master node ?I know of spark-submit and
spark-shell, however forking a process on a remote host to execute a shell
script seems like a lot of effortWhat are the recommended ways to connect
and query Spark from a remote client ?ThanksThx !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Java/Spark Library for interacting with Spark API

2016-02-28 Thread hbogert
Hi, 

Does anyone know of a Java/Scala library (not simply a HTTP library) for
interacting with Spark through its REST/HTTP API? My “problem” is that
interacting through REST induces a lot of work mapping the JSON to sensible
Spark/Scala objects. 

So a simple example, I hope there is a library which allows me to do
something like this (not a prerequisite, only as example):

   sparkHost(“10.0.01”).getApplications().first().getJobs().first().status

In broader scope, is using the REST API the only way to retrieve information
from Spark by a different (JVM) process? 

Regards,

Hans



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Java-Spark-Library-for-interacting-with-Spark-API-tp26353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org