Re: Graphframe Error

2016-07-05 Thread Felix Cheung
This could be the workaround:

http://stackoverflow.com/a/36419857




On Tue, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel" 
> wrote:

Thanks Yanbo and Felix.

I tried these commands on CDH Quickstart VM and also on "Spark 1.6 pre-built 
for Hadoop" version.  I am still not able to get it working.  Not sure what I 
am missing.  Attaching the logs.




On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung 
> wrote:
It looks like either the extracted Python code is corrupted or there is a 
mismatch Python version. Are you using Python 3?


stackoverflow.com/questions/514371/whats-the-bad-magic-number-error





On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" 
> wrote:

Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven 
repository, it was not affected by the location where the jar file was placed. 
Your examples works well in my laptop.

Or you can use try with


bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and 
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel 
>:
I started my pyspark shell with command  (I am using spark 1.6).

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

I have copied 
http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
 to the lib directory of Spark as well.

I was getting below error

>>> from graphframes import *
Traceback (most recent call last):
  File "", line 1, in 
zipimport.ZipImportError: can't find module 'graphframes'
>>>

So, as per suggestions from similar questions, I have extracted the graphframes 
python directory and copied to the local directory where I am running pyspark.

>>> from graphframes import *

But, not able to create the GraphFrame

>>> g = GraphFrame(v, e)
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'GraphFrame' is not defined

Also, I am getting below error.
>>> from graphframes.examples import Graphs
Traceback (most recent call last):
  File "", line 1, in 
ImportError: Bad magic number in graphframes/examples.pyc

Any help will be highly appreciated.

- Arun




Re: spark local dir to HDFS ?

2016-07-05 Thread sri hari kali charan Tummala
thanks makes sense, can anyone answer this below question ?

http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-td27264.html

Thanks
Sri

On Tue, Jul 5, 2016 at 8:15 PM, Saisai Shao  wrote:

> It is not worked to configure local dirs to HDFS. Local dirs are mainly
> used for data spill and shuffle data persistence, it is not suitable to use
> hdfs. If you met capacity problem, you could configure multiple dirs
> located in different mounted disks.
>
> On Wed, Jul 6, 2016 at 9:05 AM, Sri  wrote:
>
>> Hi ,
>>
>> Space issue  we are currently using /tmp and at the moment we don't have
>> any mounted location setup yet.
>>
>> Thanks
>> Sri
>>
>>
>> Sent from my iPhone
>>
>> On 5 Jul 2016, at 17:22, Jeff Zhang  wrote:
>>
>> Any reason why you want to set this on hdfs ?
>>
>> On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com <
>> kali.tumm...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> can I set spark.local.dir to HDFS location instead of /tmp folder ?
>>>
>>> I tried setting up temp folder to HDFS but it didn't worked can
>>> spark.local.dir write to HDFS ?
>>>
>>> .set("spark.local.dir","hdfs://namednode/spark_tmp/")
>>>
>>>
>>> 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local dir in
>>> hdfs://namenode/spark_tmp/. Ignoring this directory.
>>> java.io.IOException: Failed to create a temp directory (under
>>> hdfs://namenode/spark_tmp/) after 10 attempts!
>>>
>>>
>>> Thanks
>>> Sri
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-to-HDFS-tp27291.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>


-- 
Thanks & Regards
Sri Tummala


Re: spark local dir to HDFS ?

2016-07-05 Thread Saisai Shao
It is not worked to configure local dirs to HDFS. Local dirs are mainly
used for data spill and shuffle data persistence, it is not suitable to use
hdfs. If you met capacity problem, you could configure multiple dirs
located in different mounted disks.

On Wed, Jul 6, 2016 at 9:05 AM, Sri  wrote:

> Hi ,
>
> Space issue  we are currently using /tmp and at the moment we don't have
> any mounted location setup yet.
>
> Thanks
> Sri
>
>
> Sent from my iPhone
>
> On 5 Jul 2016, at 17:22, Jeff Zhang  wrote:
>
> Any reason why you want to set this on hdfs ?
>
> On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com <
> kali.tumm...@gmail.com> wrote:
>
>> Hi All,
>>
>> can I set spark.local.dir to HDFS location instead of /tmp folder ?
>>
>> I tried setting up temp folder to HDFS but it didn't worked can
>> spark.local.dir write to HDFS ?
>>
>> .set("spark.local.dir","hdfs://namednode/spark_tmp/")
>>
>>
>> 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local dir in
>> hdfs://namenode/spark_tmp/. Ignoring this directory.
>> java.io.IOException: Failed to create a temp directory (under
>> hdfs://namenode/spark_tmp/) after 10 attempts!
>>
>>
>> Thanks
>> Sri
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-to-HDFS-tp27291.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>


Re: spark local dir to HDFS ?

2016-07-05 Thread Sri
Hi ,

Space issue  we are currently using /tmp and at the moment we don't have any 
mounted location setup yet.

Thanks
Sri


Sent from my iPhone

> On 5 Jul 2016, at 17:22, Jeff Zhang  wrote:
> 
> Any reason why you want to set this on hdfs ?
> 
>> On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com 
>>  wrote:
>> Hi All,
>> 
>> can I set spark.local.dir to HDFS location instead of /tmp folder ?
>> 
>> I tried setting up temp folder to HDFS but it didn't worked can
>> spark.local.dir write to HDFS ?
>> 
>> .set("spark.local.dir","hdfs://namednode/spark_tmp/")
>> 
>> 
>> 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local dir in
>> hdfs://namenode/spark_tmp/. Ignoring this directory.
>> java.io.IOException: Failed to create a temp directory (under
>> hdfs://namenode/spark_tmp/) after 10 attempts!
>> 
>> 
>> Thanks
>> Sri
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-to-HDFS-tp27291.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang


Re: spark local dir to HDFS ?

2016-07-05 Thread Jeff Zhang
Any reason why you want to set this on hdfs ?

On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> can I set spark.local.dir to HDFS location instead of /tmp folder ?
>
> I tried setting up temp folder to HDFS but it didn't worked can
> spark.local.dir write to HDFS ?
>
> .set("spark.local.dir","hdfs://namednode/spark_tmp/")
>
>
> 16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local dir in
> hdfs://namenode/spark_tmp/. Ignoring this directory.
> java.io.IOException: Failed to create a temp directory (under
> hdfs://namenode/spark_tmp/) after 10 attempts!
>
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-to-HDFS-tp27291.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang


RE: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-05 Thread Shiryaev, Mikhail
Hi Alexander,

I used the same example from MLP user guide but on Java language.
I modified an example a little bit and there was my nasty bug that I haven’t 
noticed (some inconsistence between layers and real feature count).
After fixing that MLP works on my test.
So here is my inadvertence, sorry.

Best regards,
Mikhail Shiryaev

From: Ulanov, Alexander [mailto:alexander.ula...@hpe.com]
Sent: Tuesday, July 5, 2016 9:32 PM
To: Yanbo Liang ; Shiryaev, Mikhail 

Cc: user@spark.apache.org
Subject: RE: Spark MLlib: MultilayerPerceptronClassifier error?

Hi Mikhail,

I have followed the MLP user-guide and used the dataset and network 
configuration you mentioned. MLP was trained without any issues with default 
parameters, that is block size of 128 and 100 iterations.

Source code:
scala> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val data = sqlContext.read.format("libsvm").load("/data/aloi.scale")
scala> val trainer = new MultilayerPerceptronClassifier().setLayers(Array(128, 
128, 1000))
scala> val model = trainer.fit(data)
(after a while)
model: org.apache.spark.ml.classification.MultilayerPerceptronClassificationMode
l = mlpc_fb3bd70d2ef2

It seems that submitting an Issue is premature. Could you share your code 
instead?

Best regards, Alexander

Just in case, here is the link to the user guide:
https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier


From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Monday, July 04, 2016 9:58 PM
To: mshiryae >
Cc: user@spark.apache.org
Subject: Re: Spark MLlib: MultilayerPerceptronClassifier error?

Would you mind to file a JIRA to track this issue? I will take a look when I 
have time.

2016-07-04 14:09 GMT-07:00 mshiryae 
>:
Hi,

I am trying to train model by MultilayerPerceptronClassifier.

It works on sample data from
data/mllib/sample_multiclass_classification_data.txt with 4 features, 3
classes and layers [4, 4, 3].
But when I try to use other input files with other features and classes
(from here for example:
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html)
then I get errors.

Example:
Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]):


with block size = 1:
ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation.
Decreasing step size to Infinity
ERROR LBFGS: Failure! Resetting history:
breeze.optimize.FirstOrderException: Line search failed
ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is
just poorly behaved?


with default block size = 128:
 java.lang.ArrayIndexOutOfBoundsException
  at java.lang.System.arraycopy(Native Method)
  at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
  at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
   at scala.collection.immutable.List.foreach(List.scala:381)
   at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
   at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)



Even if I modify sample_multiclass_classification_data.txt file (rename all
4-th features to 5-th) and run with layers [5, 5, 3] then I also get the
same errors as for file above.


So to resume:
I can't run training with default block size and with more than 4 features.
If I set  block size to 1 then some actions are happened but I get errors
from LBFGS.
It is reproducible with Spark 1.5.2 and from master branch on github (from
4-th July).

Did somebody already met with such behavior?
Is there bug in MultilayerPerceptronClassifier or I use it incorrectly?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-MultilayerPerceptronClassifier-error-tp27279.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


spark local dir to HDFS ?

2016-07-05 Thread kali.tumm...@gmail.com
Hi All, 

can I set spark.local.dir to HDFS location instead of /tmp folder ?

I tried setting up temp folder to HDFS but it didn't worked can
spark.local.dir write to HDFS ? 

.set("spark.local.dir","hdfs://namednode/spark_tmp/")


16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local dir in
hdfs://namenode/spark_tmp/. Ignoring this directory.
java.io.IOException: Failed to create a temp directory (under
hdfs://namenode/spark_tmp/) after 10 attempts!


Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-local-dir-to-HDFS-tp27291.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bootstrap Action to Install Spark 2.0 on EMR?

2016-07-05 Thread Holden Karau
Just to be clear Spark 2.0 isn't released yet, there is a preview version
for developers to explore and test compatibility with. That being said Roy
Hasson has a blog post discussing using Spark 2.0-preview with EMR -
https://medium.com/@royhasson/running-spark-2-0-preview-on-emr-635081e01341#.r1tzra3ff
:)

On Saturday, July 2, 2016, Renxia Wang  wrote:

> Hi all,
>
> Anybody had tried out Spark 2.0 on EMR 4.x? Will it work? I am looking for
> a bootstrap action script to install it on EMR, does some one have a
> working one to share? Appreciate that!
>
> Best,
>
> Renxia
>


Working of Streaming Kmeans

2016-07-05 Thread Holden Karau
Hi Biplob,

The current Streaming KMeans code only updates data which comes in through
training (e.g. trainOn), predictOn does not update the model.

Cheers,

Holden :)

P.S.

Traffic on the list might be have been bit slower right now because of
Canada Day and 4th of July weekend respectively.

On Sunday, July 3, 2016, Biplob Biswas  wrote:

> Hi,
>
> Can anyone please explain this?
>
> Thanks & Regards
> Biplob Biswas
>
> On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas 
> wrote:
>
>> Hi,
>>
>> I wanted to ask a very basic question about the working of Streaming
>> Kmeans.
>>
>> Does the model update only when training (i.e. training dataset is used)
>> or
>> does it update on the PredictOnValues function as well for the test
>> dataset?
>>
>> Thanks and Regards
>> Biplob
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


SnappyData and Structured Streaming

2016-07-05 Thread Benjamin Kim
I recently got a sales email from SnappyData, and after reading the 
documentation about what they offer, it sounds very similar to what Structured 
Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD 
compliant data storage in SnappyData. I was wondering if Structured Streaming 
is trying to achieve the same on its own or is SnappyData contributing 
Streaming extensions that they built to the Spark project. Lastly, what does 
the Spark community think of this so-called “Spark Data Store”?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mich Talebzadeh
Well that is what the OP stated.

I have a spark cluster consisting of 4 nodes in a standalone mode,..

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 July 2016 at 19:24, Michael Segel  wrote:

> Did the OP say he was running a stand alone cluster of Spark, or on Yarn?
>
>
> On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh 
> wrote:
>
> Hi Jakub,
>
> Any reason why you are running in standalone mode, given that your are
> familiar with YARN?
>
> In theory your settings are correct. I checked your environment tab
> settings and they look correct.
>
> I assume you have checked this link
>
> http://spark.apache.org/docs/latest/spark-standalone.html
>
> BTW is this issue confined to ML or any other Spark application exhibits
> the same behaviour in standalone mode?
>
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 July 2016 at 11:17, Jacek Laskowski  wrote:
>
>> Hi Jakub,
>>
>> You're correct - spark.masterspark://master.clust:7077 - proves your
>> point. You're running Spark Standalone that was set in
>> conf/spark-defaults.conf perhaps.
>>
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> On Tue, Jul 5, 2016 at 12:04 PM, Jakub Stransky 
>> wrote:
>>
>>> Hello,
>>>
>>> I am convinced that we are not running in local mode:
>>>
>>> Runtime Information
>>>
>>> NameValue
>>> Java Home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>>> Java Version1.7.0_65 (Oracle Corporation)
>>> Scala Versionversion 2.10.5
>>> Spark Properties
>>>
>>> NameValue
>>> spark.app.idapp-20160704121044-0003
>>> spark.app.nameDemoApp
>>> spark.driver.extraClassPath/home/sparkuser/sqljdbc4.jar
>>> spark.driver.host10.2.0.4
>>> spark.driver.memory4g
>>> spark.driver.port59493
>>> spark.executor.extraClassPath/usr/local/spark-1.6.1/sqljdbc4.jar
>>> spark.executor.iddriver
>>> spark.executor.memory12g
>>> spark.externalBlockStore.folderName
>>>  spark-5630dd34-4267-462e-882e-b382832bb500
>>> spark.jarsfile:/home/sparkuser/SparkPOC.jar
>>> spark.masterspark://master.clust:7077
>>> spark.scheduler.modeFIFO
>>> spark.submit.deployModeclient
>>> System Properties
>>>
>>> NameValue
>>> SPARK_SUBMITtrue
>>> awt.toolkitsun.awt.X11.XToolkit
>>> file.encodingUTF-8
>>> file.encoding.pkgsun.io
>>> file.separator/
>>> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
>>> java.awt.printerjobsun.print.PSPrinterJob
>>> java.class.version51.0
>>> java.endorsed.dirs
>>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/endorsed
>>> java.ext.dirs
>>>  
>>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/ext:/usr/java/packages/lib/ext
>>> java.home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>>> java.io.tmpdir/tmp
>>> java.library.path
>>>  /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>>> java.runtime.nameOpenJDK Runtime Environment
>>> java.runtime.version1.7.0_65-mockbuild_2014_07_16_06_06-b00
>>> java.specification.nameJava Platform API Specification
>>> java.specification.vendorOracle Corporation
>>> java.specification.version1.7
>>> java.vendorOracle Corporation
>>> java.vendor.urlhttp://java.oracle.com/
>>> java.vendor.url.bughttp://bugreport.sun.com/bugreport/
>>> java.version1.7.0_65
>>> java.vm.infomixed mode
>>> java.vm.nameOpenJDK 64-Bit Server VM
>>> java.vm.specification.nameJava Virtual Machine Specification
>>> java.vm.specification.vendorOracle Corporation
>>> java.vm.specification.version1.7
>>> java.vm.vendorOracle Corporation
>>> java.vm.version24.65-b04
>>> line.separator
>>> 

RE: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-05 Thread Ulanov, Alexander
Hi Mikhail,

I have followed the MLP user-guide and used the dataset and network 
configuration you mentioned. MLP was trained without any issues with default 
parameters, that is block size of 128 and 100 iterations.

Source code:
scala> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val data = sqlContext.read.format("libsvm").load("/data/aloi.scale")
scala> val trainer = new MultilayerPerceptronClassifier().setLayers(Array(128, 
128, 1000))
scala> val model = trainer.fit(data)
(after a while)
model: org.apache.spark.ml.classification.MultilayerPerceptronClassificationMode
l = mlpc_fb3bd70d2ef2

It seems that submitting an Issue is premature. Could you share your code 
instead?

Best regards, Alexander

Just in case, here is the link to the user guide:
https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier


From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Monday, July 04, 2016 9:58 PM
To: mshiryae 
Cc: user@spark.apache.org
Subject: Re: Spark MLlib: MultilayerPerceptronClassifier error?

Would you mind to file a JIRA to track this issue? I will take a look when I 
have time.

2016-07-04 14:09 GMT-07:00 mshiryae 
>:
Hi,

I am trying to train model by MultilayerPerceptronClassifier.

It works on sample data from
data/mllib/sample_multiclass_classification_data.txt with 4 features, 3
classes and layers [4, 4, 3].
But when I try to use other input files with other features and classes
(from here for example:
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html)
then I get errors.

Example:
Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]):


with block size = 1:
ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation.
Decreasing step size to Infinity
ERROR LBFGS: Failure! Resetting history:
breeze.optimize.FirstOrderException: Line search failed
ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is
just poorly behaved?


with default block size = 128:
 java.lang.ArrayIndexOutOfBoundsException
  at java.lang.System.arraycopy(Native Method)
  at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
  at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
   at scala.collection.immutable.List.foreach(List.scala:381)
   at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
   at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)



Even if I modify sample_multiclass_classification_data.txt file (rename all
4-th features to 5-th) and run with layers [5, 5, 3] then I also get the
same errors as for file above.


So to resume:
I can't run training with default block size and with more than 4 features.
If I set  block size to 1 then some actions are happened but I get errors
from LBFGS.
It is reproducible with Spark 1.5.2 and from master branch on github (from
4-th July).

Did somebody already met with such behavior?
Is there bug in MultilayerPerceptronClassifier or I use it incorrectly?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-MultilayerPerceptronClassifier-error-tp27279.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Michael Segel
Did the OP say he was running a stand alone cluster of Spark, or on Yarn? 


> On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh  
> wrote:
> 
> Hi Jakub,
> 
> Any reason why you are running in standalone mode, given that your are 
> familiar with YARN?
> 
> In theory your settings are correct. I checked your environment tab settings 
> and they look correct.
> 
> I assume you have checked this link
> 
> http://spark.apache.org/docs/latest/spark-standalone.html 
> 
> 
> BTW is this issue confined to ML or any other Spark application exhibits the 
> same behaviour in standalone mode?
> 
> 
> HTH
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 5 July 2016 at 11:17, Jacek Laskowski  > wrote:
> Hi Jakub,
> 
> You're correct - spark.masterspark://master.clust:7077 - proves your 
> point. You're running Spark Standalone that was set in 
> conf/spark-defaults.conf perhaps.
> 
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/ 
> Mastering Apache Spark http://bit.ly/mastering-apache-spark 
> 
> Follow me at https://twitter.com/jaceklaskowski 
> 
> 
> On Tue, Jul 5, 2016 at 12:04 PM, Jakub Stransky  > wrote:
> Hello,
> 
> I am convinced that we are not running in local mode:
> 
> Runtime Information
> 
> NameValue
> Java Home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
> Java Version1.7.0_65 (Oracle Corporation)
> Scala Versionversion 2.10.5
> Spark Properties
> 
> NameValue
> spark.app.id app-20160704121044-0003
> spark.app.name DemoApp
> spark.driver.extraClassPath/home/sparkuser/sqljdbc4.jar
> spark.driver.host10.2.0.4
> spark.driver.memory4g
> spark.driver.port59493
> spark.executor.extraClassPath/usr/local/spark-1.6.1/sqljdbc4.jar
> spark.executor.id driver
> spark.executor.memory12g
> spark.externalBlockStore.folderName
> spark-5630dd34-4267-462e-882e-b382832bb500
> spark.jarsfile:/home/sparkuser/SparkPOC.jar
> spark.masterspark://master.clust:7077
> spark.scheduler.modeFIFO
> spark.submit.deployModeclient
> System Properties
> 
> NameValue
> SPARK_SUBMITtrue
> awt.toolkitsun.awt.X11.XToolkit
> file.encodingUTF-8
> file.encoding.pkgsun.io 
> file.separator/
> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
> java.awt.printerjobsun.print.PSPrinterJob
> java.class.version51.0
> java.endorsed.dirs
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/endorsed
> java.ext.dirs
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/ext:/usr/java/packages/lib/ext
> java.home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
> java.io.tmpdir/tmp
> java.library.path
> /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> java.runtime.name OpenJDK Runtime Environment
> java.runtime.version1.7.0_65-mockbuild_2014_07_16_06_06-b00
> java.specification.name Java Platform 
> API Specification
> java.specification.vendorOracle Corporation
> java.specification.version1.7
> java.vendorOracle Corporation
> java.vendor.urlhttp://java.oracle.com/ 
> java.vendor.url.bughttp://bugreport.sun.com/bugreport/ 
> 
> java.version1.7.0_65
> java.vm.info mixed mode
> java.vm.name OpenJDK 64-Bit Server VM
> java.vm.specification.name Java 
> Virtual Machine Specification
> java.vm.specification.vendorOracle Corporation
> java.vm.specification.version1.7
> java.vm.vendorOracle Corporation
> java.vm.version24.65-b04
> line.separator
> os.archamd64
> os.name Linux
> os.version2.6.32-431.29.2.el6.x86_64
> path.separator:
> sun.arch.data.model64
> sun.boot.class.path
> 

Re: Spark Dataframe validating column names

2016-07-05 Thread Scott W
Hi,

Yes I tried that however, I also want to "pin down" that specific event
containing invalid characters in the column names (per the parquet spec)
and drop it from the df before converting it to parquet.

Where I'm having trouble is my dataframe might have events with different
set of fields, so directly applying df.columns OR df.schema would return
the superset-schema for all the events.

I tried this approach but unfortunately RDD[Row] cannot be converted back
to Dataframe without a schema. I'm relying on automatic schema inference of
Spark SQL.

df.map { r =>
  val fieldNames = r.schema.fieldNames

..

Thanks!

On Tue, Jul 5, 2016 at 5:56 AM, Jacek Laskowski  wrote:

> Hi,
>
> What do you think of using df.columns to know the column names and
> process appropriately or df.schema?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jul 5, 2016 at 7:02 AM, Scott W  wrote:
> > Hello,
> >
> > I'm processing events using Dataframes converted from a stream of JSON
> > events (Spark streaming) which eventually gets written out as as Parquet
> > format. There are different JSON events coming in so we use schema
> inference
> > feature of Spark SQL
> >
> > The problem is some of the JSON events contains spaces in the keys which
> I
> > want to log and filter/drop such events from the data frame before
> > converting it to Parquet because ,;{}()\n\t= are considered special
> > characters in Parquet schema (CatalystSchemaConverter) as listed in [1]
> > below and thus should not be allowed in the column names.
> >
> > How can I do such validations in Dataframe on the column names and drop
> such
> > an event altogether without erroring out the Spark Streaming job?
> >
> > [1] Spark's CatalystSchemaConverter
> >
> > def checkFieldName(name: String): Unit = {
> > // ,;{}()\n\t= and space are special characters in Parquet schema
> > checkConversionRequirement(
> >   !name.matches(".*[ ,;{}()\n\t=].*"),
> >   s"""Attribute name "$name" contains invalid character(s) among "
> > ,;{}()\\n\\t=".
> >  |Please use alias to rename it.
> >""".stripMargin.split("\n").mkString(" ").trim)
> >   }
>


Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
Test by producing messages into kafka at a rate comparable to what you
expect in production.

Test with backpressure turned on, it doesn't require you to specify a
fixed limit on number of messages and will do its best to maintain
batch timing.  Or you could empirically determine a reasonable fixed
limit.

Setting up a kafka topic with way more static messages in it than your
system can handle in one batch, and then starting a stream from the
beginning of it without turning on backpressure or limiting the number
of messages... isn't a reasonable way to test steady state
performance.  Flink can't magically give you a correct answer under
those circumstances either.

On Tue, Jul 5, 2016 at 10:41 AM, rss rss  wrote:
> Hi, thanks.
>
>I know about possibility to limit number of messages. But the problem is
> I don't know number of messages which the system able to process. It depends
> on data. The example is very simple. I need a strict response after
> specified time. Something like soft real time. In case of Flink I able to
> setup strict time of processing like this:
>
> KeyedStream keyed =
> eventStream.keyBy(event.userId.getBytes()[0] % partNum);
> WindowedStream uniqUsersWin = keyed.timeWindow(
> Time.seconds(10) );
> DataStream uniqUsers =
> uniq.trigger(ProcessingTimeTrigger.create())
> .fold(new Aggregator(), new FoldFunction() {
> @Override
> public Aggregator fold(Aggregator accumulator, Event value)
> throws Exception {
> accumulator.add(event.userId);
> return accumulator;
> }
> });
>
> uniq.print();
>
> And I can see results every 10 seconds independently on input data stream.
> Is it possible something in Spark?
>
> Regarding zeros in my example the reason I have prepared message queue in
> Kafka for the tests. If I add some messages after I able to see new
> messages. But in any case I need first response after 10 second. Not minutes
> or hours after.
>
> Thanks.
>
>
>
> 2016-07-05 17:12 GMT+02:00 Cody Koeninger :
>>
>> If you're talking about limiting the number of messages per batch to
>> try and keep from exceeding batch time, see
>>
>> http://spark.apache.org/docs/latest/configuration.html
>>
>> look for backpressure and maxRatePerParition
>>
>>
>> But if you're only seeing zeros after your job runs for a minute, it
>> sounds like something else is wrong.
>>
>>
>> On Tue, Jul 5, 2016 at 10:02 AM, rss rss  wrote:
>> > Hello,
>> >
>> >   I'm trying to organize processing of messages from Kafka. And there is
>> > a
>> > typical case when a number of messages in kafka's queue is more then
>> > Spark
>> > app's possibilities to process. But I need a strong time limit to
>> > prepare
>> > result for at least for a part of data.
>> >
>> > Code example:
>> >
>> > SparkConf sparkConf = new SparkConf()
>> > .setAppName("Spark")
>> > .setMaster("local");
>> >
>> > JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>> > Milliseconds.apply(5000));
>> >
>> > jssc.checkpoint("/tmp/spark_checkpoint");
>> >
>> > Set topicMap = new
>> > HashSet<>(Arrays.asList(topicList.split(",")));
>> > Map kafkaParams = new HashMap()
>> > {
>> > {
>> > put("metadata.broker.list", bootstrapServers);
>> > put("auto.offset.reset", "smallest");
>> > }
>> > };
>> >
>> > JavaPairInputDStream messages =
>> > KafkaUtils.createDirectStream(jssc,
>> > String.class,
>> > String.class,
>> > StringDecoder.class,
>> > StringDecoder.class,
>> > kafkaParams,
>> > topicMap);
>> >
>> > messages.countByWindow(Seconds.apply(10),
>> > Milliseconds.apply(5000))
>> > .map(x -> {System.out.println(x); return x;})
>> > .dstream().saveAsTextFiles("/tmp/spark",
>> > "spark-streaming");
>> >
>> >
>> >   I need to see a result of window operation each 10 seconds (this is
>> > only
>> > simplest example). But really with my test data ~10M messages I have
>> > first
>> > result a minute after and further I see only zeros. Is a way to limit
>> > processing time to guarantee a response in specified time like Apache
>> > Flink's triggers?
>> >
>> > Thanks.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread rss rss
Hi, thanks.

   I know about possibility to limit number of messages. But the problem is
I don't know number of messages which the system able to process. It
depends on data. The example is very simple. I need a strict response after
specified time. Something like soft real time. In case of Flink I able to
setup strict time of processing like this:

KeyedStream keyed =
eventStream.keyBy(event.userId.getBytes()[0] % partNum);
WindowedStream uniqUsersWin =
keyed.timeWindow( *Time.seconds(10*) );
DataStream uniqUsers =
uniq.trigger(*ProcessingTimeTrigger*.create())
.fold(new Aggregator(), new FoldFunction() {
@Override
public Aggregator fold(Aggregator accumulator, Event
value) throws Exception {
accumulator.add(event.userId);
return accumulator;
}
});

uniq.print();

And I can see results every 10 seconds independently on input data stream.
Is it possible something in Spark?

Regarding zeros in my example the reason I have prepared message queue in
Kafka for the tests. If I add some messages after I able to see new
messages. But in any case I need first response after 10 second. Not
minutes or hours after.

Thanks.



2016-07-05 17:12 GMT+02:00 Cody Koeninger :

> If you're talking about limiting the number of messages per batch to
> try and keep from exceeding batch time, see
>
> http://spark.apache.org/docs/latest/configuration.html
>
> look for backpressure and maxRatePerParition
>
>
> But if you're only seeing zeros after your job runs for a minute, it
> sounds like something else is wrong.
>
>
> On Tue, Jul 5, 2016 at 10:02 AM, rss rss  wrote:
> > Hello,
> >
> >   I'm trying to organize processing of messages from Kafka. And there is
> a
> > typical case when a number of messages in kafka's queue is more then
> Spark
> > app's possibilities to process. But I need a strong time limit to prepare
> > result for at least for a part of data.
> >
> > Code example:
> >
> > SparkConf sparkConf = new SparkConf()
> > .setAppName("Spark")
> > .setMaster("local");
> >
> > JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
> > Milliseconds.apply(5000));
> >
> > jssc.checkpoint("/tmp/spark_checkpoint");
> >
> > Set topicMap = new
> > HashSet<>(Arrays.asList(topicList.split(",")));
> > Map kafkaParams = new HashMap() {
> > {
> > put("metadata.broker.list", bootstrapServers);
> > put("auto.offset.reset", "smallest");
> > }
> > };
> >
> > JavaPairInputDStream messages =
> > KafkaUtils.createDirectStream(jssc,
> > String.class,
> > String.class,
> > StringDecoder.class,
> > StringDecoder.class,
> > kafkaParams,
> > topicMap);
> >
> > messages.countByWindow(Seconds.apply(10),
> Milliseconds.apply(5000))
> > .map(x -> {System.out.println(x); return x;})
> > .dstream().saveAsTextFiles("/tmp/spark",
> "spark-streaming");
> >
> >
> >   I need to see a result of window operation each 10 seconds (this is
> only
> > simplest example). But really with my test data ~10M messages I have
> first
> > result a minute after and further I see only zeros. Is a way to limit
> > processing time to guarantee a response in specified time like Apache
> > Flink's triggers?
> >
> > Thanks.
>


Re: remove row from data frame

2016-07-05 Thread nihed mbarek
hi,

doing multiple filters to keep data that you need.

regards,

On Tue, Jul 5, 2016 at 5:38 PM, pseudo oduesp  wrote:

> Hi ,
> how i can remove row from data frame  verifying some condition on some
> columns ?
> thanks
>



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




remove row from data frame

2016-07-05 Thread pseudo oduesp
Hi ,
how i can remove row from data frame  verifying some condition on some
columns ?
thanks


Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mathieu Longtin
>From experience, here's the kind of things that cause the driver to run out
of memory:
- Way too many partitions (1 and up)
- Something like this:
data = load_large_data()
rdd = sc.parallelize(data)

- Any call to rdd.collect() or rdd.take(N) where the resulting data is
bigger than driver memory.
- rdd.limit(N) seems to crash on large N.

Btw, in this context, Java's memory requirement are in the order of 10X
what the raw data requires. So if you have a CSV with 1 million lines of
1KB each, expect the JVM to require 10GB to load it, not 1GB. This is not
an exact number, just an impression from observing what crashes the driver
when doing rdd.collect().

On Tue, Jul 5, 2016 at 11:19 AM Jakub Stransky 
wrote:

> So now that we clarified that all is submitted at cluster standalone mode
> what is left when the application (ML pipeline) doesn't take advantage of
> full cluster power but essentially running just on master node until
> resources are exhausted. Why training ml Decesion Tree doesn't scale to the
> rest of the cluster?
>
> On 5 July 2016 at 12:17, Jacek Laskowski  wrote:
>
>> Hi Jakub,
>>
>> You're correct - spark.masterspark://master.clust:7077 - proves your
>> point. You're running Spark Standalone that was set in
>> conf/spark-defaults.conf perhaps.
>>
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> On Tue, Jul 5, 2016 at 12:04 PM, Jakub Stransky 
>> wrote:
>>
>>> Hello,
>>>
>>> I am convinced that we are not running in local mode:
>>>
>>> Runtime Information
>>>
>>> NameValue
>>> Java Home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>>> Java Version1.7.0_65 (Oracle Corporation)
>>> Scala Versionversion 2.10.5
>>> Spark Properties
>>>
>>> NameValue
>>> spark.app.idapp-20160704121044-0003
>>> spark.app.nameDemoApp
>>> spark.driver.extraClassPath/home/sparkuser/sqljdbc4.jar
>>> spark.driver.host10.2.0.4
>>> spark.driver.memory4g
>>> spark.driver.port59493
>>> spark.executor.extraClassPath/usr/local/spark-1.6.1/sqljdbc4.jar
>>> spark.executor.iddriver
>>> spark.executor.memory12g
>>> spark.externalBlockStore.folderName
>>>  spark-5630dd34-4267-462e-882e-b382832bb500
>>> spark.jarsfile:/home/sparkuser/SparkPOC.jar
>>> spark.masterspark://master.clust:7077
>>> spark.scheduler.modeFIFO
>>> spark.submit.deployModeclient
>>> System Properties
>>>
>>> NameValue
>>> SPARK_SUBMITtrue
>>> awt.toolkitsun.awt.X11.XToolkit
>>> file.encodingUTF-8
>>> file.encoding.pkgsun.io
>>> file.separator/
>>> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
>>> java.awt.printerjobsun.print.PSPrinterJob
>>> java.class.version51.0
>>> java.endorsed.dirs
>>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/endorsed
>>> java.ext.dirs
>>>  
>>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/ext:/usr/java/packages/lib/ext
>>> java.home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>>> java.io.tmpdir/tmp
>>> java.library.path
>>>  /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>>> java.runtime.nameOpenJDK Runtime Environment
>>> java.runtime.version1.7.0_65-mockbuild_2014_07_16_06_06-b00
>>> java.specification.nameJava Platform API Specification
>>> java.specification.vendorOracle Corporation
>>> java.specification.version1.7
>>> java.vendorOracle Corporation
>>> java.vendor.urlhttp://java.oracle.com/
>>> java.vendor.url.bughttp://bugreport.sun.com/bugreport/
>>> java.version1.7.0_65
>>> java.vm.infomixed mode
>>> java.vm.nameOpenJDK 64-Bit Server VM
>>> java.vm.specification.nameJava Virtual Machine Specification
>>> java.vm.specification.vendorOracle Corporation
>>> java.vm.specification.version1.7
>>> java.vm.vendorOracle Corporation
>>> java.vm.version24.65-b04
>>> line.separator
>>> os.archamd64
>>> os.nameLinux
>>> os.version2.6.32-431.29.2.el6.x86_64
>>> path.separator:
>>> sun.arch.data.model64
>>> sun.boot.class.path
>>>  
>>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/resources.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rt.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jsse.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jce.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/charsets.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rhino.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jfr.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/classes
>>> sun.boot.library.path
>>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/amd64
>>> sun.cpu.endianlittle
>>> sun.cpu.isalist
>>> 

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mich Talebzadeh
Hi Jakub,

Any reason why you are running in standalone mode, given that your are
familiar with YARN?

In theory your settings are correct. I checked your environment tab
settings and they look correct.

I assume you have checked this link

http://spark.apache.org/docs/latest/spark-standalone.html

BTW is this issue confined to ML or any other Spark application exhibits
the same behaviour in standalone mode?


HTH






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 July 2016 at 11:17, Jacek Laskowski  wrote:

> Hi Jakub,
>
> You're correct - spark.masterspark://master.clust:7077 - proves your
> point. You're running Spark Standalone that was set in
> conf/spark-defaults.conf perhaps.
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Tue, Jul 5, 2016 at 12:04 PM, Jakub Stransky 
> wrote:
>
>> Hello,
>>
>> I am convinced that we are not running in local mode:
>>
>> Runtime Information
>>
>> NameValue
>> Java Home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>> Java Version1.7.0_65 (Oracle Corporation)
>> Scala Versionversion 2.10.5
>> Spark Properties
>>
>> NameValue
>> spark.app.idapp-20160704121044-0003
>> spark.app.nameDemoApp
>> spark.driver.extraClassPath/home/sparkuser/sqljdbc4.jar
>> spark.driver.host10.2.0.4
>> spark.driver.memory4g
>> spark.driver.port59493
>> spark.executor.extraClassPath/usr/local/spark-1.6.1/sqljdbc4.jar
>> spark.executor.iddriver
>> spark.executor.memory12g
>> spark.externalBlockStore.folderName
>>  spark-5630dd34-4267-462e-882e-b382832bb500
>> spark.jarsfile:/home/sparkuser/SparkPOC.jar
>> spark.masterspark://master.clust:7077
>> spark.scheduler.modeFIFO
>> spark.submit.deployModeclient
>> System Properties
>>
>> NameValue
>> SPARK_SUBMITtrue
>> awt.toolkitsun.awt.X11.XToolkit
>> file.encodingUTF-8
>> file.encoding.pkgsun.io
>> file.separator/
>> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
>> java.awt.printerjobsun.print.PSPrinterJob
>> java.class.version51.0
>> java.endorsed.dirs
>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/endorsed
>> java.ext.dirs
>>  
>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/ext:/usr/java/packages/lib/ext
>> java.home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>> java.io.tmpdir/tmp
>> java.library.path
>>  /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>> java.runtime.nameOpenJDK Runtime Environment
>> java.runtime.version1.7.0_65-mockbuild_2014_07_16_06_06-b00
>> java.specification.nameJava Platform API Specification
>> java.specification.vendorOracle Corporation
>> java.specification.version1.7
>> java.vendorOracle Corporation
>> java.vendor.urlhttp://java.oracle.com/
>> java.vendor.url.bughttp://bugreport.sun.com/bugreport/
>> java.version1.7.0_65
>> java.vm.infomixed mode
>> java.vm.nameOpenJDK 64-Bit Server VM
>> java.vm.specification.nameJava Virtual Machine Specification
>> java.vm.specification.vendorOracle Corporation
>> java.vm.specification.version1.7
>> java.vm.vendorOracle Corporation
>> java.vm.version24.65-b04
>> line.separator
>> os.archamd64
>> os.nameLinux
>> os.version2.6.32-431.29.2.el6.x86_64
>> path.separator:
>> sun.arch.data.model64
>> sun.boot.class.path
>>  
>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/resources.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rt.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jsse.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jce.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/charsets.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rhino.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jfr.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/classes
>> sun.boot.library.path
>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/amd64
>> sun.cpu.endianlittle
>> sun.cpu.isalist
>> sun.io.unicode.encodingUnicodeLittle
>> sun.java.commandorg.apache.spark.deploy.SparkSubmit --conf
>> spark.driver.extraClassPath=/home/sparkuser/sqljdbc4.jar --class  

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Jakub Stransky
So now that we clarified that all is submitted at cluster standalone mode
what is left when the application (ML pipeline) doesn't take advantage of
full cluster power but essentially running just on master node until
resources are exhausted. Why training ml Decesion Tree doesn't scale to the
rest of the cluster?

On 5 July 2016 at 12:17, Jacek Laskowski  wrote:

> Hi Jakub,
>
> You're correct - spark.masterspark://master.clust:7077 - proves your
> point. You're running Spark Standalone that was set in
> conf/spark-defaults.conf perhaps.
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Tue, Jul 5, 2016 at 12:04 PM, Jakub Stransky 
> wrote:
>
>> Hello,
>>
>> I am convinced that we are not running in local mode:
>>
>> Runtime Information
>>
>> NameValue
>> Java Home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>> Java Version1.7.0_65 (Oracle Corporation)
>> Scala Versionversion 2.10.5
>> Spark Properties
>>
>> NameValue
>> spark.app.idapp-20160704121044-0003
>> spark.app.nameDemoApp
>> spark.driver.extraClassPath/home/sparkuser/sqljdbc4.jar
>> spark.driver.host10.2.0.4
>> spark.driver.memory4g
>> spark.driver.port59493
>> spark.executor.extraClassPath/usr/local/spark-1.6.1/sqljdbc4.jar
>> spark.executor.iddriver
>> spark.executor.memory12g
>> spark.externalBlockStore.folderName
>>  spark-5630dd34-4267-462e-882e-b382832bb500
>> spark.jarsfile:/home/sparkuser/SparkPOC.jar
>> spark.masterspark://master.clust:7077
>> spark.scheduler.modeFIFO
>> spark.submit.deployModeclient
>> System Properties
>>
>> NameValue
>> SPARK_SUBMITtrue
>> awt.toolkitsun.awt.X11.XToolkit
>> file.encodingUTF-8
>> file.encoding.pkgsun.io
>> file.separator/
>> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
>> java.awt.printerjobsun.print.PSPrinterJob
>> java.class.version51.0
>> java.endorsed.dirs
>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/endorsed
>> java.ext.dirs
>>  
>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/ext:/usr/java/packages/lib/ext
>> java.home/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
>> java.io.tmpdir/tmp
>> java.library.path
>>  /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>> java.runtime.nameOpenJDK Runtime Environment
>> java.runtime.version1.7.0_65-mockbuild_2014_07_16_06_06-b00
>> java.specification.nameJava Platform API Specification
>> java.specification.vendorOracle Corporation
>> java.specification.version1.7
>> java.vendorOracle Corporation
>> java.vendor.urlhttp://java.oracle.com/
>> java.vendor.url.bughttp://bugreport.sun.com/bugreport/
>> java.version1.7.0_65
>> java.vm.infomixed mode
>> java.vm.nameOpenJDK 64-Bit Server VM
>> java.vm.specification.nameJava Virtual Machine Specification
>> java.vm.specification.vendorOracle Corporation
>> java.vm.specification.version1.7
>> java.vm.vendorOracle Corporation
>> java.vm.version24.65-b04
>> line.separator
>> os.archamd64
>> os.nameLinux
>> os.version2.6.32-431.29.2.el6.x86_64
>> path.separator:
>> sun.arch.data.model64
>> sun.boot.class.path
>>  
>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/resources.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rt.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jsse.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jce.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/charsets.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rhino.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jfr.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/classes
>> sun.boot.library.path
>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/amd64
>> sun.cpu.endianlittle
>> sun.cpu.isalist
>> sun.io.unicode.encodingUnicodeLittle
>> sun.java.commandorg.apache.spark.deploy.SparkSubmit --conf
>> spark.driver.extraClassPath=/home/sparkuser/sqljdbc4.jar --class  --class
>> DemoApp SparkPOC.jar 10 4.3
>> sun.java.launcherSUN_STANDARD
>> sun.jnu.encodingUTF-8
>> sun.management.compilerHotSpot 64-Bit Tiered Compilers
>> sun.nio.ch.bugLevel
>> sun.os.patch.levelunknown
>> user.countryUS
>> user.dir/home/sparkuser
>> user.home/home/sparkuser
>> user.languageen
>> user.namesparkuser
>> user.timezoneEtc/UTC
>> Classpath Entries
>>
>> ResourceSource
>> /home/sparkuser/sqljdbc4.jarSystem Classpath
>> /usr/local/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.1-hadoop2.2.0.jar
>>System Classpath
>> /usr/local/spark-1.6.1/conf/System Classpath
>> 

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
If you're talking about limiting the number of messages per batch to
try and keep from exceeding batch time, see

http://spark.apache.org/docs/latest/configuration.html

look for backpressure and maxRatePerParition


But if you're only seeing zeros after your job runs for a minute, it
sounds like something else is wrong.


On Tue, Jul 5, 2016 at 10:02 AM, rss rss  wrote:
> Hello,
>
>   I'm trying to organize processing of messages from Kafka. And there is a
> typical case when a number of messages in kafka's queue is more then Spark
> app's possibilities to process. But I need a strong time limit to prepare
> result for at least for a part of data.
>
> Code example:
>
> SparkConf sparkConf = new SparkConf()
> .setAppName("Spark")
> .setMaster("local");
>
> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
> Milliseconds.apply(5000));
>
> jssc.checkpoint("/tmp/spark_checkpoint");
>
> Set topicMap = new
> HashSet<>(Arrays.asList(topicList.split(",")));
> Map kafkaParams = new HashMap() {
> {
> put("metadata.broker.list", bootstrapServers);
> put("auto.offset.reset", "smallest");
> }
> };
>
> JavaPairInputDStream messages =
> KafkaUtils.createDirectStream(jssc,
> String.class,
> String.class,
> StringDecoder.class,
> StringDecoder.class,
> kafkaParams,
> topicMap);
>
> messages.countByWindow(Seconds.apply(10), Milliseconds.apply(5000))
> .map(x -> {System.out.println(x); return x;})
> .dstream().saveAsTextFiles("/tmp/spark", "spark-streaming");
>
>
>   I need to see a result of window operation each 10 seconds (this is only
> simplest example). But really with my test data ~10M messages I have first
> result a minute after and further I see only zeros. Is a way to limit
> processing time to guarantee a response in specified time like Apache
> Flink's triggers?
>
> Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark streaming. Strict discretizing by time

2016-07-05 Thread rss rss
Hello,

  I'm trying to organize processing of messages from Kafka. And there is a
typical case when a number of messages in kafka's queue is more then
Spark app's possibilities to process. But I need a strong time limit to
prepare result for at least for a part of data.

Code example:

SparkConf sparkConf = new SparkConf()
.setAppName("Spark")
.setMaster("local");

JavaStreamingContext jssc = new
JavaStreamingContext(sparkConf, Milliseconds.apply(5000));

jssc.checkpoint("/tmp/spark_checkpoint");

Set topicMap = new
HashSet<>(Arrays.asList(topicList.split(",")));
Map kafkaParams = new HashMap() {
{
put("metadata.broker.list", bootstrapServers);
put("auto.offset.reset", "smallest");
}
};

JavaPairInputDStream messages =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicMap);

messages.countByWindow(Seconds.apply(10), Milliseconds.apply(5000))
*.map(x -> {System.out.println(x); return x;})*
.dstream().saveAsTextFiles("/tmp/spark", "spark-streaming");


  I need to see a result of window operation each 10 seconds (this is only
simplest example). But really with my test data ~10M messages I have first
result a minute after and further I see only zeros. Is a way to limit
processing time to guarantee a response in specified time like Apache
Flink's triggers?

Thanks.


Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Cody Koeninger
If it's a batch job, don't use a stream.

You have to store the offsets reliably somewhere regardless.  So it sounds
like your only issue is with identifying offsets per partition?  Look at
KafkaCluster.scala, methods getEarliestLeaderOffsets /
getLatestLeaderOffsets.

On Tue, Jul 5, 2016 at 7:40 AM, Bruckwald Tamás  wrote:

> Thanks for you answer. Unfortunately I'm bound to Kafka 0.8.2.1.
> --Bruckwald
>
> nihed mbarek  írta:
>
> Hi,
>
> Are you using a new version of kafka  ? if yes
> since 0.9 auto.offset.reset parameter take :
>
>- earliest: automatically reset the offset to the earliest offset
>- latest: automatically reset the offset to the latest offset
>- none: throw exception to the consumer if no previous offset is found
>for the consumer's group
>- anything else: throw exception to the consumer.
>
> https://kafka.apache.org/documentation.html
>
>
> Regards,
>
> On Tue, Jul 5, 2016 at 2:15 PM, Bruckwald Tamás <
> tamas.bruckw...@freemail.hu> wrote:
>>
>> Hello,
>>
>> I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
>> For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
>> however, I need to set the offsets for all the partitions and also need to
>> store them somewhere (ZK? HDFS?) to know from where to start the next batch
>> job.
>> What is the right approach to read from Kafka in a batch job?
>>
>> I'm also thinking about writing a streaming job instead, which reads from
>> auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the
>> next run it starts from that.
>> But in this case how can I just fetch once and stop streaming after the
>> first batch?
>>
>> I posted this question on StackOverflow recently (
>> http://stackoverflow.com/q/38026627/4020050) but got no answer there, so
>> I'd ask here as well, hoping that I get some ideas on how to resolve this
>> issue.
>>
>> Thanks - Bruckwald
>>
>
>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


Re: Standalone mode resource allocation questions

2016-07-05 Thread Jacek Laskowski
On Tue, Jul 5, 2016 at 4:18 PM, Jakub Stransky  wrote:

> 1) Is it possible to configure multiple executors per worker machine?

Yes.

> Do I understand it correctly that I specify SPARK_WORKER_MEMORY and
> SPARK_WORKER_CORES which essentially describes available resources to spark
> at that machine. And the number of executors actually run depends on
> spark.executor.memory setting and number of run executors is
> SPARK_WORKER_MEMORY/ spark.executor.memory

Sort of. Use the following conf/spark-env.sh to have 2 workers:

SPARK_WORKER_CORES=2
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_MEMORY=2g

> 2) How do I limit resource at the application submission time?
> I can change executor-memory when submitting application but that specifies
> just size of the executor right? That actually allows dynamically change
> number of executors run on worker machine. Is there a way how to limit the
> number of executors per application or so? For example because of more
> application running on cluster..

Don't think Standalone supports it.

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Standalone mode resource allocation questions

2016-07-05 Thread Jakub Stransky
Hello,

I went through Spark documentation and several posts from Cloudera etc  and
as my background is heavily on Hadoop/YARN there is a little confusion
still there. Could someone more experienced clarify please?

What I am trying to achieve:
- Running cluster in standalone mode version 1.6.1

Questions - mainly about resource management on standalone mode
1) Is it possible to configure multiple executors per worker machine?
Do I understand it correctly that I specify SPARK_WORKER_MEMORY and
SPARK_WORKER_CORES which essentially describes available resources to spark
at that machine. And the number of executors actually run depends on
spark.executor.memory setting and number of run executors is
SPARK_WORKER_MEMORY/ spark.executor.memory

2) How do I limit resource at the application submission time?
I can change executor-memory when submitting application but that specifies
just size of the executor right? That actually allows dynamically change
number of executors run on worker machine. Is there a way how to limit the
number of executors per application or so? For example because of more
application running on cluster..

Thx


Having issues of passing properties to Spark in 1.5 in comparison to 1.2

2016-07-05 Thread Nkechi Achara
After using Spark 1.2 for quite a long time, I have realised that you can
no longer pass spark configuration to the driver via the --conf via command
line (or in my case shell script).

I am thinking about using system properties and picking the config up using
the following bit of code:

def getConfigOption(conf: SparkConf, name: String)
conf getOption name orElse sys.props.get(name)

How do i pass a config.file option and string version of the date specified
as a start time to a spark-submit command?

I have attempted using the following in my start up shell script:
other Config options \
--conf
"spark.executor.extraJavaOptions=-Dconfig.file=../conf/mifid.conf
-DstartTime=2016-06-04 00:00:00" \

but this fails at the space in the date splits the command up.

Any idea how to do this successfully, or has anyone got any advice on this
one?

Thanks

K


Re: Spark Dataframe validating column names

2016-07-05 Thread Jacek Laskowski
Hi,

What do you think of using df.columns to know the column names and
process appropriately or df.schema?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jul 5, 2016 at 7:02 AM, Scott W  wrote:
> Hello,
>
> I'm processing events using Dataframes converted from a stream of JSON
> events (Spark streaming) which eventually gets written out as as Parquet
> format. There are different JSON events coming in so we use schema inference
> feature of Spark SQL
>
> The problem is some of the JSON events contains spaces in the keys which I
> want to log and filter/drop such events from the data frame before
> converting it to Parquet because ,;{}()\n\t= are considered special
> characters in Parquet schema (CatalystSchemaConverter) as listed in [1]
> below and thus should not be allowed in the column names.
>
> How can I do such validations in Dataframe on the column names and drop such
> an event altogether without erroring out the Spark Streaming job?
>
> [1] Spark's CatalystSchemaConverter
>
> def checkFieldName(name: String): Unit = {
> // ,;{}()\n\t= and space are special characters in Parquet schema
> checkConversionRequirement(
>   !name.matches(".*[ ,;{}()\n\t=].*"),
>   s"""Attribute name "$name" contains invalid character(s) among "
> ,;{}()\\n\\t=".
>  |Please use alias to rename it.
>""".stripMargin.split("\n").mkString(" ").trim)
>   }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread Bruckwald Tamás
Thanks for you answer. Unfortunately Im bound to Kafka 0.8.2.1.--Bruckwald
nihed mbarek  írta:
>Hi, Are you using a new version of kafka  ? if yessince 0.9 auto.offset.reset 
>parameter take :earliest: automatically reset the offset to the earliest 
>offsetlatest: automatically reset the offset to the latest offsetnone: throw 
>exception to the consumer if no previous offset is found for the 
>consumers groupanything else: throw exception to the 
>consumer.https://kafka.apache.org/documentation.html  Regards, On Tue, Jul 5, 
>2016 at 2:15 PM, Bruckwald Tamás  wrote:

>>Hello, Im writing a Spark (v1.6.0) batch job which reads from a Kafka 
>>topic.
>>For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD 
>>however, I need to set the offsets for all the partitions and also need to 
>>store them somewhere (ZK? HDFS?) to know from where to start the next batch 
>>job.What is the right approach to read from Kafka in a batch job? Im 
>>also thinking about writing a streaming job instead, which reads from 
>>auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the 
>>next run it starts from that.But in this case how can I just fetch once and 
>>stop streaming after the first batch? I posted this question on StackOverflow 
>>recently (http://stackoverflow.com/q/38026627/4020050) but got no answer 
>>there, so Id ask here as well, hoping that I get some ideas on how to 
>>resolve this issue.
>> Thanks - Bruckwald

>
>
>
>--
>MBAREK Med Nihed,
>Fedora Ambassador, TUNISIA, Northern Africa
>http://www.nihed.com
>
>
> 


Re: Graphframe Error

2016-07-05 Thread Arun Patel
Thanks Yanbo and Felix.

I tried these commands on CDH Quickstart VM and also on "Spark 1.6
pre-built for Hadoop" version.  I am still not able to get it working.  Not
sure what I am missing.  Attaching the logs.




On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung 
wrote:

> It looks like either the extracted Python code is corrupted or there is a
> mismatch Python version. Are you using Python 3?
>
>
> stackoverflow.com/questions/514371/whats-the-bad-magic-number-error
>
>
>
>
>
> On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" 
> wrote:
>
> Hi Arun,
>
> The command
>
> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>
> will automatically load the required graphframes jar file from maven
> repository, it was not affected by the location where the jar file was
> placed. Your examples works well in my laptop.
>
> Or you can use try with
>
> bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar
>
> to launch PySpark with graphframes enabled. You should set "--py-files"
> and "--jars" options with the directory where you saved graphframes.jar.
>
> Thanks
> Yanbo
>
>
> 2016-07-03 15:48 GMT-07:00 Arun Patel :
>
>> I started my pyspark shell with command  (I am using spark 1.6).
>>
>> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>>
>> I have copied
>> http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
>> to the lib directory of Spark as well.
>>
>> I was getting below error
>>
>> >>> from graphframes import *
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> zipimport.ZipImportError: can't find module 'graphframes'
>> >>>
>>
>> So, as per suggestions from similar questions, I have extracted the
>> graphframes python directory and copied to the local directory where I am
>> running pyspark.
>>
>> >>> from graphframes import *
>>
>> But, not able to create the GraphFrame
>>
>> >>> g = GraphFrame(v, e)
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> NameError: name 'GraphFrame' is not defined
>>
>> Also, I am getting below error.
>> >>> from graphframes.examples import Graphs
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> ImportError: Bad magic number in graphframes/examples.pyc
>>
>> Any help will be highly appreciated.
>>
>> - Arun
>>
>
>


Graphframes-packages.log
Description: Binary data


Graphframes-pyfiles.log
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread nihed mbarek
Hi,

Are you using a new version of kafka  ? if yes
since 0.9 auto.offset.reset parameter take :

   - earliest: automatically reset the offset to the earliest offset
   - latest: automatically reset the offset to the latest offset
   - none: throw exception to the consumer if no previous offset is found
   for the consumer's group
   - anything else: throw exception to the consumer.

https://kafka.apache.org/documentation.html


Regards,

On Tue, Jul 5, 2016 at 2:15 PM, Bruckwald Tamás  wrote:

> Hello,
>
> I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
> For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
> however, I need to set the offsets for all the partitions and also need to
> store them somewhere (ZK? HDFS?) to know from where to start the next batch
> job.
> What is the right approach to read from Kafka in a batch job?
>
> I'm also thinking about writing a streaming job instead, which reads from
> auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the
> next run it starts from that.
> But in this case how can I just fetch once and stop streaming after the
> first batch?
>
> I posted this question on StackOverflow recently (
> http://stackoverflow.com/q/38026627/4020050) but got no answer there, so
> I'd ask here as well, hoping that I get some ideas on how to resolve this
> issue.
>
> Thanks - Bruckwald
>



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Read Kafka topic in a Spark batch job

2016-07-05 Thread Bruckwald Tamás
Hello, Im writing a Spark (v1.6.0) batch job which reads from a Kafka 
topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD 
however, I need to set the offsets for all the partitions and also need to 
store them somewhere (ZK? HDFS?) to know from where to start the next batch 
job.What is the right approach to read from Kafka in a batch job? Im also 
thinking about writing a streaming job instead, which reads from 
auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the 
next run it starts from that.But in this case how can I just fetch once and 
stop streaming after the first batch? I posted this question on StackOverflow 
recently (http://stackoverflow.com/q/38026627/4020050) but got no answer there, 
so Id ask here as well, hoping that I get some ideas on how to resolve 
this issue.
 Thanks - Bruckwald

Spark MLlib: network intensive algorithms

2016-07-05 Thread mshiryae
Hi,

I have a question wrt ML algorithms.
What are the most network intensive algorithms in Spark MLlib?

I have already looked at ALS (as pointed here:
https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
ALS is pretty communication and computation intensive but in the latest
Spark ALS is pretty optimized in terms of communications).

As far as I understand ML algorithm do many computations on node and
periodically perform small exchanges between nodes.

May be you know such examples when algorithm is network-bounded or when
network exchanges consume noticeable time in training time?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-network-intensive-algorithms-tp27287.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



StreamingKmeans Spark doesn't work at all

2016-07-05 Thread Biplob Biswas

Hi, 

I implemented the streamingKmeans example provided in the spark website but
in Java. 
The full implementation is here, 

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below: 

--- 
Time: 1466176935000 ms 
--- 

Also, i have 2 directories: 
"D:\spark\streaming example\Data Sets\training" 
"D:\spark\streaming example\Data Sets\test" 

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints. 

I am very new to the spark systems and any help is highly appreciated. 

//---//

Now, I also have now tried using the scala implementation available here:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala


and even had the training and test file provided in the format specified in
that file as follows:

 * The rows of the training text files must be vector data in the form
 * `[x1,x2,x3,...,xn]`
 * Where n is the number of dimensions.
 *
 * The rows of the test text files must be labeled data in the form
 * `(y,[x1,x2,x3,...,xn])`
 * Where y is some identifier. n must be the same for train and test.


But I still get no output on my eclipse window ... just the Time! 

Can anyone seriously help me with this? 

Thank you so much 
Biplob Biswas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKmeans-Spark-doesn-t-work-at-all-tp27286.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dataframe sort

2016-07-05 Thread tan shai
Hi,

I need to sort a dataframe and retrive the bounds of each partition.
The dataframe.sort() is using the range partitioning in the physical plan.

I need to retrieve partition bounds.

Many thanks for your help.


Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
It all depends on your latency requirements and volume. 100s of queries per
minute, with an acceptable latency of up to a few seconds? Yes, you could
use Spark for serving, especially if you're smart about caching results
(and I don't mean just Spark caching, but caching recommendation results
for example similar items etc).

However for many serving use cases using a Spark cluster is too much
overhead. Bear in mind real-world serving of many models (recommendations,
ad-serving, fraud etc) is one component of a complex workflow (e.g. one
page request in ad tech cases involves tens of requests and hops between
various ad servers and exchanges). That is why often the practical latency
bounds are < 100ms (or way, way tighter for ad serving for example).


On Fri, 1 Jul 2016 at 21:59 Saurabh Sardeshpande 
wrote:

> Hi Nick,
>
> Thanks for the answer. Do you think an implementation like the one in this
> article is infeasible in production for say, hundreds of queries per
> minute?
> https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
> The article uses Flask to define routes and Spark for evaluating requests.
>
> Regards,
> Saurabh
>
>
>
>
>
>
> On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath 
> wrote:
>
>> Generally there are 2 ways to use a trained pipeline model - (offline)
>> batch scoring, and real-time online scoring.
>>
>> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
>> certainly loading the model back in Spark and feeding new data through the
>> pipeline for prediction works just fine, and this is essentially what is
>> supported in 1.6 (and more or less full coverage in 2.0). For large batch
>> cases this can be quite efficient.
>>
>> However, usually for real-time use cases, the latency required is fairly
>> low - of the order of a few ms to a few 100ms for a request (some examples
>> include recommendations, ad-serving, fraud detection etc).
>>
>> In these cases, using Spark has 2 issues: (1) latency for prediction on
>> the pipeline, which is based on DataFrames and therefore distributed
>> execution, is usually fairly high "per request"; (2) this requires pulling
>> in all of Spark for your real-time serving layer (or running a full Spark
>> cluster), which is usually way too much overkill - all you really need for
>> serving is a bit of linear algebra and some basic transformations.
>>
>> So for now, unfortunately there is not much in the way of options for
>> exporting your pipelines and serving them outside of Spark - the
>> JPMML-based project mentioned on this thread is one option. The other
>> option at this point is to write your own export functionality and your own
>> serving layer.
>>
>> There is (very initial) movement towards improving the local serving
>> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
>> was the "first step" in this process).
>>
>> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski  wrote:
>>
>>> Hi Rishabh,
>>>
>>> I've just today had similar conversation about how to do a ML Pipeline
>>> deployment and couldn't really answer this question and more because I
>>> don't really understand the use case.
>>>
>>> What would you expect from ML Pipeline model deployment? You can save
>>> your model to a file by model.write.overwrite.save("model_v1").
>>>
>>> model_v1
>>> |-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-0
>>> `-- stages
>>> |-- 0_regexTok_b4265099cc1c
>>> |   `-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-0
>>> |-- 1_hashingTF_8de997cf54ba
>>> |   `-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-0
>>> `-- 2_linReg_3942a71d2c0e
>>> |-- data
>>> |   |-- _SUCCESS
>>> |   |-- _common_metadata
>>> |   |-- _metadata
>>> |   `--
>>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>>> `-- metadata
>>> |-- _SUCCESS
>>> `-- part-0
>>>
>>> 9 directories, 12 files
>>>
>>> What would you like to have outside SparkContext? What's wrong with
>>> using Spark? Just curious hoping to understand the use case better.
>>> Thanks.
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
>>> wrote:
>>> > Hi All,
>>> >
>>> > I am looking for ways to deploy a ML Pipeline model in production .
>>> > Spark has already proved to be a one of the best framework for model
>>> > training and creation, but once the ml pipeline model is ready how can
>>> I
>>> > deploy it outside spark context ?
>>> > MLlib model has toPMML method but today Pipeline model can not be
>>> saved to
>>> > PMML. There are some frameworks like MLeap which are 

Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
Sean is correct - we now use jpmml-model (which is actually BSD 3-clause,
where old jpmml was A2L, but either work)

On Fri, 1 Jul 2016 at 21:40 Sean Owen  wrote:

> (The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use
> JPMML in Spark and couldn't otherwise because the Affero license is
> not Apache compatible.)
>
> On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath 
> wrote:
> > I believe open-scoring is one of the well-known PMML serving frameworks
> in
> > Java land (https://github.com/jpmml/openscoring). One can also use the
> raw
> > https://github.com/jpmml/jpmml-evaluator for embedding in apps.
> >
> > (Note the license on both of these is AGPL - the older version of JPMML
> used
> > to be Apache2 if I recall correctly).
> >
>


pyspark: dataframe.take is slow

2016-07-05 Thread immerrr again
Hi all!

I'm having a strange issue with pyspark 1.6.1. I have a dataframe,

df = sqlContext.read.parquet('/path/to/data')

whose "df.take(10)" is really slow, apparently scanning the whole
dataset to take the first ten rows. "df.first()" works fast, as does
"df.rdd.take(10)".

I have found https://issues.apache.org/jira/browse/SPARK-10731 that
should have fixed it in 1.6.0, but it has not. What am i doing wrong
here and how can I fix this?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



?????? Enforcing shuffle hash join

2016-07-05 Thread ??????
you can try set "spark.shuffle.manager" to "hash".
this is the meaning of the parameter:
Implementation to use for shuffling data. There are two implementations 
available:sort and hash. Sort-based shuffle is more memory-efficient and is the 
default option starting in 1.2.




--  --
??: "Lalitha MV";;
: 2016??7??5??(??) 2:44
??: "Sun Rui"; 
: "Takeshi Yamamuro"; 
"user@spark.apache.org"; 
: Re: Enforcing shuffle hash join



By setting the preferSortMergeJoin to false, it still only picks between Merge 
Join and Broadcast join. Does not pick shuffle hash join depending on 
autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct 
clean way to enforce it. 



On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui  wrote:
You can try set ??spark.sql.join.preferSortMergeJoin?? cons option to false.

For detailed join strategies, take a look at the source code of 
SparkStrategies.scala:
/**
 * Select the proper physical plan for join based on joining keys and size of 
logical plan.
 *
 * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
least some of the
 * predicates can be evaluated by matching join keys. If found,  Join 
implementations are chosen
 * with the following precedence:
 *
 * - Broadcast: if one side of the join has an estimated physical size that is 
smaller than the
 * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
 * or if that side has an explicit broadcast hint (e.g. the user applied the
 * [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), 
then that side
 * of the join will be broadcasted and the other side will be streamed, 
with no shuffling
 * performed. If both sides of the join are eligible to be broadcasted then 
the
 * - Shuffle hash join: if the average size of a single partition is small 
enough to build a hash
 * table.
 * - Sort merge: if the matching join keys are sortable.
 *
 * If there is no joining keys, Join implementations are chosen with the 
following precedence:
 * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
 * - CartesianProduct: for Inner join
 * - BroadcastNestedLoopJoin
 */



On Jul 5, 2016, at 13:28, Lalitha MV  wrote:

It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to 
-1, or when the size of the small table is more than 
spark.sql.spark.sql.autoBroadcastJoinThreshold.

On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro  wrote:
The join selection can be described in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92.
If you have join keys, you can set -1 at `spark.sql.autoBroadcastJoinThreshold` 
to disable broadcast joins. Then, hash joins are used in queries.


// maropu 


On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
Hi maropu, 

Thanks for your reply. 


Would it be possible to write a rule for this, to make it always pick shuffle 
hash join, over other join implementations(i.e. sort merge and broadcast)? 


Is there any documentation demonstrating rule based transformation for physical 
plan trees? 


Thanks,
Lalitha


On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro  wrote:
Hi,

No, spark has no hint for the hash join.


// maropu


On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  wrote:
Hi, 

In order to force broadcast hash join, we can set the 
spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce shuffle 
hash join in spark sql? 




Thanks,Lalitha


 

 






-- 
---
Takeshi Yamamuro



 
 






-- 
Regards,Lalitha


 
 






-- 
---
Takeshi Yamamuro



 
 




-- 
Regards,Lalitha


 
 











-- 
Regards,Lalitha

Re: java.io.FileNotFoundException

2016-07-05 Thread Jacek Laskowski
On Tue, Jul 5, 2016 at 2:16 AM, kishore kumar  wrote:

> 2016-07-04 05:11:53,972 [dispatcher-event-loop-0] ERROR
> org.apache.spark.scheduler.LiveListenerBus- Dropping SparkListenerEvent
> because no remaining room in event q
> ueue. This likely means one of the SparkListeners is too slow and cannot
> keep up with the rate at which tasks are being started by the scheduler.

This one is worth pursuing...Think that without more logs from the
driver/AM/executors it's hard to find the root cause. Do you
cache/persist? How much? How often? How much pressure is there on the
executors? What about the driver?

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-05 Thread Yu Wei
Hi Deng,


Thanks for the help. Actually I need pay more attention to memory usage.

I found the root cause in my problem. It seemed that it existed in spark 
streaming MQTTUtils module.

When I use "localhost" in brokerURL, it doesn't work.

After change it to "127.0.0.1", it works now.


Thanks again,

Jared




From: odeach...@gmail.com  on behalf of Deng Ching-Mallete 

Sent: Tuesday, July 5, 2016 4:03:28 PM
To: Yu Wei
Cc: user@spark.apache.org
Subject: Re: Is that possible to launch spark streaming application on yarn 
with only one machine?

Hi Jared,

You can launch a Spark application even with just a single node in YARN, 
provided that the node has enough resources to run the job.

It might also be good to note that when YARN calculates the memory allocation 
for the driver and the executors, there is an additional memory overhead that 
is added for each executor then it gets rounded up to the nearest GB, IIRC. So 
the 4G driver-memory + 4x2G executor memory do not necessarily translate to a 
total of 12G memory allocation. It would be more than that, so the node would 
need to have more than 12G of memory for the job to execute in YARN. You should 
be able to see something like "No resources available in cluster.." in the 
application master logs in YARN if that is the case.

HTH,
Deng

On Tue, Jul 5, 2016 at 4:31 PM, Yu Wei 
> wrote:

Hi guys,

I set up pseudo hadoop/yarn cluster on my labtop.

I wrote a simple spark streaming program as below to receive messages with 
MQTTUtils.

conf = new SparkConf().setAppName("Monitor");
jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream inputDS = MQTTUtils.createStream(jssc, 
brokerUrl, topic);

inputDS.print();
jssc.start();
jssc.awaitTermination()


If I submitted the app with "--master local[2]", it works well.

spark-submit --master local[4] --driver-memory 4g --executor-memory 2g 
--num-executors 4 target/CollAna-1.0-SNAPSHOT.jar

If I submitted with "--master yarn",  no output for "inputDS.print()".

spark-submit --master yarn --deploy-mode cluster --driver-memory 4g 
--executor-memory 2g --num-executors 4 target/CollAna-1.0-SNAPSHOT.jar

Is it possible to launch spark application on yarn with only one single node?


Thanks for your advice.


Jared




How Spark HA works

2016-07-05 Thread Akmal Abbasov

Hi, 
I'm trying to understand how Spark HA works. I'm using Spark 1.6.1 and 
Zookeeper 3.4.6.
I've add the following line to $SPARK_HOME/conf/spark-env.sh
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181 
-Dspark.deploy.zookeeper.dir=/spark
It's working so far.
I'd like to setup a link which will always go to active master UI(I'm using 
Spark in Standalone).  
I've checked the znode /spark, and it contains 
[leader_election, master_status]
I'm assuming that master_status znode will contain ip address of the current 
active master, is it true? Because in my case this znode isn't updated after 
failover.
And how /spark/leader_election works, because it doesn't contain any data.
Thank you.

Regards,
Akmal




Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-05 Thread Deng Ching-Mallete
Hi Jared,

You can launch a Spark application even with just a single node in YARN,
provided that the node has enough resources to run the job.

It might also be good to note that when YARN calculates the memory
allocation for the driver and the executors, there is an additional memory
overhead that is added for each executor then it gets rounded up to the
nearest GB, IIRC. So the 4G driver-memory + 4x2G executor memory do not
necessarily translate to a total of 12G memory allocation. It would be more
than that, so the node would need to have more than 12G of memory for the
job to execute in YARN. You should be able to see something like "No
resources available in cluster.." in the application master logs in YARN if
that is the case.

HTH,
Deng

On Tue, Jul 5, 2016 at 4:31 PM, Yu Wei  wrote:

> Hi guys,
>
> I set up pseudo hadoop/yarn cluster on my labtop.
>
> I wrote a simple spark streaming program as below to receive messages with
> MQTTUtils.
> conf = new SparkConf().setAppName("Monitor");
> jssc = new JavaStreamingContext(conf, Durations.seconds(1));
> JavaReceiverInputDStream inputDS = MQTTUtils.createStream(jssc,
> brokerUrl, topic);
>
> inputDS.print();
> jssc.start();
> jssc.awaitTermination()
>
> If I submitted the app with "--master local[2]", it works well.
>
> spark-submit --master local[4] --driver-memory 4g --executor-memory 2g
> --num-executors 4 target/CollAna-1.0-SNAPSHOT.jar
>
> If I submitted with "--master yarn",  no output for "inputDS.print()".
>
> spark-submit --master yarn --deploy-mode cluster --driver-memory 4g
> --executor-memory 2g --num-executors 4 target/CollAna-1.0-SNAPSHOT.jar
>
> Is it possible to launch spark application on yarn with only one single
> node?
>
>
> Thanks for your advice.
>
>
> Jared
>
>
>


Re: How to Create a Database in Spark SQL

2016-07-05 Thread Mich Talebzadeh
it should work


*spark-sql> create database somedb;*OK
Time taken: 2.694 seconds

*spark-sql> show databases;*OK
accounts
asehadoop
default
iqhadoop
mytable_db
oraclehadoop

*somedb*test
twitterdb
Time taken: 1.277 seconds, Fetched 9 row(s)


*spark-sql> use somedb;OK*Time taken: 0.059 seconds
spark-sql> create table test (col1 int, col2 string);
OK
Time taken: 0.245 seconds

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 July 2016 at 07:57, lokeshyadav  wrote:

> Hi
> I am very new to SparkSQL and I have a very basic question: How do I create
> a database or multiple databases in sparkSQL. I am executing the SQL from
> spark-sql CLI. The query like in hive: /create database sample_db/ does not
> work here. I have Hadoop 2.7 and Spark 1.6 installed on my system.
>
> Regards
> Lokesh Yadav
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Create-a-Database-in-Spark-SQL-tp27281.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


How to Create a Database in Spark SQL

2016-07-05 Thread lokeshyadav
Hi
I am very new to SparkSQL and I have a very basic question: How do I create
a database or multiple databases in sparkSQL. I am executing the SQL from
spark-sql CLI. The query like in hive: /create database sample_db/ does not
work here. I have Hadoop 2.7 and Spark 1.6 installed on my system.

Regards
Lokesh Yadav




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Create-a-Database-in-Spark-SQL-tp27281.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Enforcing shuffle hash join

2016-07-05 Thread Lalitha MV
By setting the preferSortMergeJoin to false, it still only picks between
Merge Join and Broadcast join. Does not pick shuffle hash join depending on
autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct
clean way to enforce it.


On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui  wrote:

> You can try set *“spark.sql.join.preferSortMergeJoin” cons option to
> false.*
>
> *For detailed join strategies, take a look at the source code of
> SparkStrategies.scala:*
>
> /**
>  * Select the proper physical plan for join based on joining keys and size of 
> logical plan.
>  *
>  * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
> least some of the
>  * predicates can be evaluated by matching join keys. If found,  Join 
> implementations are chosen
>  * with the following precedence:
>  *
>  * - Broadcast: if one side of the join has an estimated physical size that 
> is smaller than the
>  * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
>  * or if that side has an explicit broadcast hint (e.g. the user applied 
> the
>  * [[org.apache.spark.sql.functions.broadcast()]] function to a 
> DataFrame), then that side
>  * of the join will be broadcasted and the other side will be streamed, 
> with no shuffling
>  * performed. If both sides of the join are eligible to be broadcasted 
> then the
>  * - Shuffle hash join: if the average size of a single partition is small 
> enough to build a hash
>  * table.
>  * - Sort merge: if the matching join keys are sortable.
>  *
>  * If there is no joining keys, Join implementations are chosen with the 
> following precedence:
>  * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
>  * - CartesianProduct: for Inner join
>  * - BroadcastNestedLoopJoin
>  */
>
>
>
> On Jul 5, 2016, at 13:28, Lalitha MV  wrote:
>
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is
> set to -1, or when the size of the small table is more than spark.sql.
> spark.sql.autoBroadcastJoinThreshold.
>
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro 
> wrote:
>
>> The join selection can be described in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>> .
>> If you have join keys, you can set -1 at
>> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
>> hash joins are used in queries.
>>
>> // maropu
>>
>> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
>>
>>> Hi maropu,
>>>
>>> Thanks for your reply.
>>>
>>> Would it be possible to write a rule for this, to make it always pick
>>> shuffle hash join, over other join implementations(i.e. sort merge and
>>> broadcast)?
>>>
>>> Is there any documentation demonstrating rule based transformation for
>>> physical plan trees?
>>>
>>> Thanks,
>>> Lalitha
>>>
>>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro >> > wrote:
>>>
 Hi,

 No, spark has no hint for the hash join.

 // maropu

 On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV 
 wrote:

> Hi,
>
> In order to force broadcast hash join, we can set
> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
> shuffle hash join in spark sql?
>
>
> Thanks,
> Lalitha
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>>
>>> --
>>> Regards,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>
>
>


-- 
Regards,
Lalitha


Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-05 Thread Yu Wei
Hi guys,

I set up pseudo hadoop/yarn cluster on my labtop.

I wrote a simple spark streaming program as below to receive messages with 
MQTTUtils.

conf = new SparkConf().setAppName("Monitor");
jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream inputDS = MQTTUtils.createStream(jssc, 
brokerUrl, topic);

inputDS.print();
jssc.start();
jssc.awaitTermination()


If I submitted the app with "--master local[2]", it works well.

spark-submit --master local[4] --driver-memory 4g --executor-memory 2g 
--num-executors 4 target/CollAna-1.0-SNAPSHOT.jar

If I submitted with "--master yarn",  no output for "inputDS.print()".

spark-submit --master yarn --deploy-mode cluster --driver-memory 4g 
--executor-memory 2g --num-executors 4 target/CollAna-1.0-SNAPSHOT.jar

Is it possible to launch spark application on yarn with only one single node?


Thanks for your advice.


Jared