Re: how the sparksession initialization, set currentDatabase value?

2017-01-10 Thread smartzjp

I think if you want to run spark sql on CLI this configuration will be ok, but 
if you want to run with distributed query engine,  start the JDBC/ODBC server 
and set the hive address info.

You can reference this description for more detail.
http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine


-


Spark read hive table, catalog. CurrentDatabase value is the default, how the 
sparksession initialization, set currentDatabase value?




hive.metastore.uris
thrift://localhost:9083
IP address (or fully-qualified domain name) and port of 
the metastore host






Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-10 Thread Raju Bairishetti
Hello,

   Spark sql is generating query plan with all partitions information even
though if we apply filters on partitions in the query.  Due to this, spark
driver/hive metastore is hitting with OOM as each table is with lots of
partitions.

We can confirm from hive audit logs that it tries to *fetch all partitions*
from hive metastore.

 2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(371)) - ugi=rajubip=/x.x.x.x
cmd=get_partitions : db= tbl=x


Configured the following parameters in the spark conf to fix the above
issue(source: from spark-jira & github pullreq):

*spark.sql.hive.convertMetastoreParquet   false*
*spark.sql.hive.metastorePartitionPruning   true*


*   plan:  rdf.explain*
*   == Physical Plan ==*
   HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
28),(hour#317 = 2),(venture#318 = DEFAULT)]

*get_partitions_by_filter* method is called and fetching only required
partitions.

But we are seeing parquetDecode errors in our applications frequently
after this. Looks like these decoding errors were because of changing
serde from spark-builtin to hive serde.

I feel like,* fixing query plan generation in the spark-sql* is the right
approach instead of forcing users to use hive serde.

Is there any workaround/way to fix this issue? I would like to hear more
thoughts on this :)

--
Thanks,
Raju Bairishetti,
www.lazada.com


how the sparksession initialization, set currentDatabase value?

2017-01-10 Thread 李斌松
Spark read hive table, catalog. CurrentDatabase value is the default, how
the sparksession initialization, set currentDatabase value?





hive.metastore.uris
thrift://localhost:9083
IP address (or fully-qualified domain name) and
port of the metastore host




large number representation

2017-01-10 Thread Zephod
I want to process in Spark large numbers, for example 160 bits. I could store
them as an array of ints or as a java.util.BitSet or something with
compression like https://github.com/lemire/javaewah or
https://github.com/RoaringBitmap/RoaringBitmap. My question is what should I
use so that Spark works the fastest (the operations that I do on those
numbers are computational light). Should my priority by:
1. that the fewest objects are created and basic types are used the most (is
array of ints good for that?)
2. or that the representation consumes the fewest bytes for example by doing
its own compression
3. or maybe it is possible to combine 1 and 2, by using some some kind of
native Spark compression

Thank you in advance for all help,
Regards Zephod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/large-number-representation-tp28297.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark in docker over EC2

2017-01-10 Thread Darren Govoni
Anyone got a good guide for getting spark master to talk to remote workers 
inside dockers? I followed the tips found by searching but doesn't work still. 
Spark 1.6.2.
I exposed all the ports and tried to set local IP inside container to the host 
IP but spark complains it can't bind ui ports.
Thanks in advance!


Sent from my Verizon, Samsung Galaxy smartphone

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Michael Gummelt
Oh, interesting.  I've never heard of that sort of architecture.  And I'm
not sure exactly how the JNI bindings do the native library discovery, but
I know the MESOS_NATIVE_JAVA_LIBRARY env var has always been the documented
discovery method, so I'd definitely always provide that if I were you.  I'm
not sure why the bindings can't discover based on the standard shared
library mechanisms (LD_LIBRARY_PATH, ld.so.conf).  That's a question for
the Mesos team.

On Tue, Jan 10, 2017 at 12:46 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> nop, there is no "distribution", no spark-submit at the start of my
> process.
> But I found the problem, the behavior when loading mesos native dependency
> changed, and the static initialization block inside 
> org.apache.mesos.MesosSchedulerDriver
> needed the specific reference to libmesos-1.0.0.so.
>
> So just for the record, setting the env variable
> MESOS_NATIVE_JAVA_LIBRARY="//
> libmesos-1.0.0.so" fixed the whole thing.
>
> Thanks for the help !
>
> @michael if you want to talk about the setup we're using, we can talk
> about it directly [image: simple_smile].
>
>
>
> On Tue, Jan 10, 2017 9:31 PM, Michael Gummelt mgumm...@mesosphere.io
> wrote:
>
>> What do you mean your driver has all the dependencies packaged?  What are
>> "all the dependencies"?  Is the distribution you use to launch your driver
>> built with -Pmesos?
>>
>> On Tue, Jan 10, 2017 at 12:18 PM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>> Hi Michael,
>> I did so, but it's not exactly the problem, you see my driver has all the
>> dependencies packaged, and only the executors fetch via the
>> spark.executor.uri the tgz,
>> The strange thing is that I see in my classpath the
>> org.apache.mesos:mesos-1.0.0-shaded-protobuf dependency packaged in the
>> final dist of my app…
>> So everything should work in theory.
>>
>>
>>
>> On Tue, Jan 10, 2017 7:22 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>> Just build with -Pmesos http://spark.apache.org/docs/l
>> atest/building-spark.html#building-with-mesos-support
>>
>> On Tue, Jan 10, 2017 at 8:56 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>> I had the same problem, added spark-mesos as dependency and now I get :
>> [2017-01-10 17:45:16,575] {bash_operator.py:77} INFO - Exception in
>> thread "main" java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.mesos.MesosSchedulerDriver
>> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils
>> $class.createSchedulerDriver(MesosSchedulerUtils.scala:105)
>> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedS
>> chedulerBackend.createSchedulerDriver(MesosCoarseGrainedSche
>> dulerBackend.scala:48)
>> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedS
>> chedulerBackend.start(MesosCoarseGrainedSchedulerBackend.scala:155)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSched
>> ulerImpl.scala:156)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.SparkContext.(SparkContext.scala:509)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(S
>> parkSession.scala:868)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(S
>> parkSession.scala:860)
>> [2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
>> scala.Option.getOrElse(Option.scala:121)
>> [2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkS
>> ession.scala:860)
>>
>> Is there any other dependency to add for spark 2.1.0 ?
>>
>>
>>
>> On Tue, Jan 10, 2017 1:26 AM, Abhishek Bhandari abhi10...@gmail.com
>> wrote:
>>
>> Glad that you found it.
>> ᐧ
>>
>> On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling 
>> wrote:
>>
>> Probably found it, it turns out that Mesos should be explicitly added
>> while building Spark, I assumed I could use the old build command that I
>> used for building Spark 2.0.0... Didn't see the two lines added in the
>> documentation...
>>
>> Maybe these kind of changes could be added in the changelog under changes
>> of behaviour or changes in the build process or something like that,
>>
>> kind regards,
>> Richard
>>
>>
>> On 9 January 2017 at 22:55, Richard Siebeling 
>> wrote:
>>
>> Hi,
>>
>> I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
>> parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
>> Mesos is running fine (both the master as 

Re: Nested ifs in sparksql

2017-01-10 Thread Olivier Girardot
Are you using the "case when" functions ? what do you mean by slow ? can you
share a snippet ?
 





On Tue, Jan 10, 2017 8:15 PM, Georg Heiler georg.kf.hei...@gmail.com
wrote:
Maybe you can create an UDF?
Raghavendra Pandey  schrieb am Di., 10. Jan. 2017
um 20:04 Uhr:
I have of around 41 level of nested if else in spark sql. I have programmed it
using apis on dataframe. But it takes too much time. 
Is there anything I can do to improve on time here? 



Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
nop, there is no "distribution", no spark-submit at the start of my process.But
I found the problem, the behavior when loading mesos native dependency changed,
and the static initialization block inside org.apache.mesos.MesosSchedulerDriver
needed the specific reference to libmesos-1.0.0.so.
So just for the record, setting the env variable
MESOS_NATIVE_JAVA_LIBRARY="//libmesos-1.0.0.so" fixed the whole thing.
Thanks for the help !
@michael if you want to talk about the setup we're using, we can talk about it
directly .
 





On Tue, Jan 10, 2017 9:31 PM, Michael Gummelt mgumm...@mesosphere.io
wrote:
What do you mean your driver has all the dependencies packaged?  What are "all
the dependencies"?  Is the distribution you use to launch your driver built with
-Pmesos?

On Tue, Jan 10, 2017 at 12:18 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
Hi Michael,I did so, but it's not exactly the problem, you see my driver has all
the dependencies packaged, and only the executors fetch via the
spark.executor.uri the tgz,The strange thing is that I see in my classpath the
org.apache.mesos:mesos-1.0.0-shaded-protobuf dependency packaged in the final
dist of my app…So everything should work in theory.

 





On Tue, Jan 10, 2017 7:22 PM, Michael Gummelt mgumm...@mesosphere.io
wrote:
Just build with -Pmesos 
http://spark.apache.org/docs/latest/building-spark.html#
building-with-mesos-support

On Tue, Jan 10, 2017 at 8:56 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
I had the same problem, added spark-mesos as dependency and now I get :
[2017-01-10 17:45:16,575] {bash_operator.py:77} INFO - Exception in thread
"main" java.lang.NoClassDefFoundError: Could not initialize class
org.apache.mesos.MesosSchedulerDriver[2017-01-10 17:45:16,576]
{bash_operator.py:77} INFO - at org.apache.spark.scheduler.clu
ster.mesos.MesosSchedulerUtils$class.createSchedulerDriver(M
esosSchedulerUtils.scala:105)[2017-01-10 17:45:16,576] {bash_operator.py:77}
INFO - at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedS
chedulerBackend.createSchedulerDriver(MesosCoarseGrainedSchedulerBackend.
scala:48)[2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedS
chedulerBackend.start(MesosCoarseGrainedSchedulerBackend.scala:155)[2017-01-10
17:45:16,577] {bash_operator.py:77} INFO - at org.apache.spark.scheduler.Tas
kSchedulerImpl.start(TaskSchedulerImpl.scala:156)[2017-01-10 17:45:16,577]
{bash_operator.py:77} INFO - at org.apache.spark.SparkContext.
(SparkContext.scala:509)[2017-01-10 17:45:16,577] {bash_operator.py:77}
INFO - at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
[2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(
SparkSession.scala:868)[2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(
SparkSession.scala:860)[2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
scala.Option.getOrElse(Option.scala:121)[2017-01-10 17:45:16,578]
{bash_operator.py:77} INFO - at org.apache.spark.sql.SparkSess
ion$Builder.getOrCreate(SparkSession.scala:860)
Is there any other dependency to add for spark 2.1.0 ?

 





On Tue, Jan 10, 2017 1:26 AM, Abhishek Bhandari abhi10...@gmail.com
wrote:
Glad that you found it.ᐧ
On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling   wrote:
Probably found it, it turns out that Mesos should be explicitly added while
building Spark, I assumed I could use the old build command that I used for
building Spark 2.0.0... Didn't see the two lines added in the documentation...
Maybe these kind of changes could be added in the changelog under changes of
behaviour or changes in the build process or something like that,
kind regards,Richard

On 9 January 2017 at 22:55, Richard Siebeling   wrote:
Hi,
I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not parse
Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.Mesos is running fine (both the
master as the slave, it's a single machine configuration).
I really don't understand why this is happening since the same configuration but
using a Spark 2.0.0 is running fine within Vagrant.Could someone please help?
thanks in advance,Richard






-- 
Abhishek J BhandariMobile No. +1 510 493 6205  (USA)
Mobile No. +91 96387 93021  (IND)R & D DepartmentValent Software Inc. CAEmail: 
abhis...@valent-software.com

 

Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94
 


-- 
Michael Gummelt
Software Engineer
Mesosphere


Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94
 


-- 
Michael Gummelt
Software Engineer
Mesosphere


Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Michael Gummelt
What do you mean your driver has all the dependencies packaged?  What are
"all the dependencies"?  Is the distribution you use to launch your driver
built with -Pmesos?

On Tue, Jan 10, 2017 at 12:18 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi Michael,
> I did so, but it's not exactly the problem, you see my driver has all the
> dependencies packaged, and only the executors fetch via the
> spark.executor.uri the tgz,
> The strange thing is that I see in my classpath the
> org.apache.mesos:mesos-1.0.0-shaded-protobuf dependency packaged in the
> final dist of my app…
> So everything should work in theory.
>
>
>
> On Tue, Jan 10, 2017 7:22 PM, Michael Gummelt mgumm...@mesosphere.io
> wrote:
>
>> Just build with -Pmesos http://spark.apache.org/docs/
>> latest/building-spark.html#building-with-mesos-support
>>
>> On Tue, Jan 10, 2017 at 8:56 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>> I had the same problem, added spark-mesos as dependency and now I get :
>> [2017-01-10 17:45:16,575] {bash_operator.py:77} INFO - Exception in
>> thread "main" java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.mesos.MesosSchedulerDriver
>> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils
>> $class.createSchedulerDriver(MesosSchedulerUtils.scala:105)
>> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedS
>> chedulerBackend.createSchedulerDriver(MesosCoarseGrainedSchedulerBackend.
>> scala:48)
>> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedS
>> chedulerBackend.start(MesosCoarseGrainedSchedulerBackend.scala:155)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSched
>> ulerImpl.scala:156)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.SparkContext.(SparkContext.scala:509)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(
>> SparkSession.scala:868)
>> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(
>> SparkSession.scala:860)
>> [2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
>> scala.Option.getOrElse(Option.scala:121)
>> [2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(
>> SparkSession.scala:860)
>>
>> Is there any other dependency to add for spark 2.1.0 ?
>>
>>
>>
>> On Tue, Jan 10, 2017 1:26 AM, Abhishek Bhandari abhi10...@gmail.com
>> wrote:
>>
>> Glad that you found it.
>> ᐧ
>>
>> On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling 
>> wrote:
>>
>> Probably found it, it turns out that Mesos should be explicitly added
>> while building Spark, I assumed I could use the old build command that I
>> used for building Spark 2.0.0... Didn't see the two lines added in the
>> documentation...
>>
>> Maybe these kind of changes could be added in the changelog under changes
>> of behaviour or changes in the build process or something like that,
>>
>> kind regards,
>> Richard
>>
>>
>> On 9 January 2017 at 22:55, Richard Siebeling 
>> wrote:
>>
>> Hi,
>>
>> I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
>> parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
>> Mesos is running fine (both the master as the slave, it's a single
>> machine configuration).
>>
>> I really don't understand why this is happening since the same
>> configuration but using a Spark 2.0.0 is running fine within Vagrant.
>> Could someone please help?
>>
>> thanks in advance,
>> Richard
>>
>>
>>
>>
>>
>>
>>
>> --
>> *Abhishek J Bhandari*
>> Mobile No. +1 510 493 6205 <(510)%20493-6205> (USA)
>> Mobile No. +91 96387 93021 <+91%2096387%2093021> (IND)
>> *R & D Department*
>> *Valent Software Inc. CA*
>> Email: *abhis...@valent-software.com *
>>
>>
>>
>> *Olivier Girardot* | Associé
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>



-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
Hi Michael,I did so, but it's not exactly the problem, you see my driver has
all the dependencies packaged, and only the executors fetch via the
spark.executor.uri the tgz,The strange thing is that I see in my classpath the
org.apache.mesos:mesos-1.0.0-shaded-protobuf dependency packaged in the final
dist of my app…So everything should work in theory.
 





On Tue, Jan 10, 2017 7:22 PM, Michael Gummelt mgumm...@mesosphere.io
wrote:
Just build with -Pmesos 
http://spark.apache.org/docs/latest/building-spark.html#building-with-mesos-support

On Tue, Jan 10, 2017 at 8:56 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
I had the same problem, added spark-mesos as dependency and now I get :
[2017-01-10 17:45:16,575] {bash_operator.py:77} INFO - Exception in thread
"main" java.lang.NoClassDefFoundError: Could not initialize class
org.apache.mesos.MesosSchedulerDriver[2017-01-10 17:45:16,576]
{bash_operator.py:77} INFO - at org.apache.spark.scheduler.cluster.mesos.
MesosSchedulerUtils$class.createSchedulerDriver(MesosSchedulerUtils.scala:105)
[2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
kend.createSchedulerDriver(MesosCoarseGrainedSchedulerBackend.scala:48)
[2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
kend.start(MesosCoarseGrainedSchedulerBackend.scala:155)[2017-01-10
17:45:16,577] {bash_operator.py:77} INFO - at org.apache.spark.scheduler.
TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)[2017-01-10 17:45:16,577]
{bash_operator.py:77} INFO - at org.apache.spark.SparkContext.
(SparkContext.scala:509)[2017-01-10 17:45:16,577] {bash_operator.py:77}
INFO - at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
[2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at org.apache.spark.sql.
SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)[2017-01-10
17:45:16,577] {bash_operator.py:77} INFO - at org.apache.spark.sql.
SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)[2017-01-10
17:45:16,578] {bash_operator.py:77} INFO - at scala.Option.getOrElse(Option.
scala:121)[2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
Is there any other dependency to add for spark 2.1.0 ?

 





On Tue, Jan 10, 2017 1:26 AM, Abhishek Bhandari abhi10...@gmail.com
wrote:
Glad that you found it.ᐧ
On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling   wrote:
Probably found it, it turns out that Mesos should be explicitly added while
building Spark, I assumed I could use the old build command that I used for
building Spark 2.0.0... Didn't see the two lines added in the documentation...
Maybe these kind of changes could be added in the changelog under changes of
behaviour or changes in the build process or something like that,
kind regards,Richard

On 9 January 2017 at 22:55, Richard Siebeling   wrote:
Hi,
I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not parse
Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.Mesos is running fine (both the
master as the slave, it's a single machine configuration).
I really don't understand why this is happening since the same configuration but
using a Spark 2.0.0 is running fine within Vagrant.Could someone please help?
thanks in advance,Richard






-- 
Abhishek J BhandariMobile No. +1 510 493 6205  (USA)
Mobile No. +91 96387 93021  (IND)R & D DepartmentValent Software Inc. CAEmail: 
abhis...@valent-software.com

 

Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94
 


-- 
Michael Gummelt
Software Engineer
Mesosphere


Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: Dataset Type safety

2017-01-10 Thread Michael Armbrust
>
> As I've specified *.as[Person]* which does schema inferance then
> *"option("inferSchema","true")" *is redundant and not needed!


The resolution of fields is done by name, not by position for case
classes.  This is what allows us to support more complex things like JSON
or nested structures.  If you you just want to map it by position you can
do .as[(String, Long)] to map it to a tuple instead.

And lastly does .as[Person] check that column value matches with data type
> i.e. "age Long" would fail if it gets a non numeric value! because the
> input file could be millions of row which could be very time consuming.


No, this is a static check based on the schema.  It does not scan the data
(though schema inference does).

On Tue, Jan 10, 2017 at 11:34 AM, A Shaikh  wrote:

> I have a simple people.csv and following SimpleApp
>
>
> people.csv
> --
> name,age
> abc,22
> xyz,32
>
> 
> Working Code
> 
> Object SimpleApp {}
>   case class Person(name: String, age: Long)
>   def main(args: Array[String]): Unit = {
> val spark = SparkFactory.getSparkSession("PIPE2Dataset")
> import spark.implicits._
>
> val peopleDS = spark.read.option("inferSchema","true").option("header",
> "true").option("delimiter", ",").csv("/people.csv").as[Person]
> }
> 
>
>
> 
> Fails for data with no header
> 
> Removing header record "name,age" AND switching header option off
> =>.option("header", "false") return error => *cannot resolve '`name`'
> given input columns: [_c0, _c1]*
> val peopleDS = spark.read.option("inferSchema","true").option("header",
> "false").option("delimiter", ",").csv("/people.csv").as[Person]
>
> Should'nt this just assing the header from Person class
>
>
> 
> invalid data
> 
> As I've specified *.as[Person]* which does schema inferance then 
> *"option("inferSchema","true")"
> *is redundant and not needed!
>
>
> And lastly does .as[Person] check that column value matches with data type
> i.e. "age Long" would fail if it gets a non numeric value! because the
> input file could be millions of row which could be very time consuming.
>


Library dependencies in Spark

2017-01-10 Thread Keith Turner
I recently wrote a blog post[1] sharing my experiences with using
Apache Spark to load data into Apache Fluo. One of the things I cover
in this blog post is late binding of dependencies and exclusion of
provided dependencies when building a shaded jar.  When writing the
post, I was unsure about dependency isolation and convergence
expectations in the Spark env.

Does Spark support any form of dependency isolation for user code?
For example can the Spark framework use Guava ver X while user code
uses Guava version Y?  This is assuming the user packaged Guava
version Y in their shaded jar.  Or, are Spark users expected to
converge their user dependency versions with those used by Spark?  For
example, the user is expected to converge their code to use Guava
version X which is used by the Spark framework.

[1]: http://fluo.apache.org/blog/2016/12/22/spark-load/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dataset Type safety

2017-01-10 Thread A Shaikh
I have a simple people.csv and following SimpleApp


people.csv
--
name,age
abc,22
xyz,32


Working Code

Object SimpleApp {}
  case class Person(name: String, age: Long)
  def main(args: Array[String]): Unit = {
val spark = SparkFactory.getSparkSession("PIPE2Dataset")
import spark.implicits._

val peopleDS = spark.read.option("inferSchema","true").option("header",
"true").option("delimiter", ",").csv("/people.csv").as[Person]
}




Fails for data with no header

Removing header record "name,age" AND switching header option off
=>.option("header", "false") return error => *cannot resolve '`name`' given
input columns: [_c0, _c1]*
val peopleDS = spark.read.option("inferSchema","true").option("header",
"false").option("delimiter", ",").csv("/people.csv").as[Person]

Should'nt this just assing the header from Person class



invalid data

As I've specified *.as[Person]* which does schema inferance then
*"option("inferSchema","true")"
*is redundant and not needed!


And lastly does .as[Person] check that column value matches with data type
i.e. "age Long" would fail if it gets a non numeric value! because the
input file could be millions of row which could be very time consuming.


SparkAppHandle.Listener: infoChanged

2017-01-10 Thread Benson Qiu
The JavaDocs

say that `infoChanged` is a "callback for changes in any information that
is not the handle's state."

*Can I get more information on what counts as a change in information? How
often does this callback occur?*

I am running into an issue on YARN where the Mapper container is failing
due to a timeout. I want to pass a Hadoop Context

and
periodically update the status to let the Mapper know that the Spark job is
making progress.

I've included some pseudocode below for what I would like to do. This code
would only work if `infoChanged` is called frequently (i.e. if my Mapper
timeout is 10 minutes, `infoChanged` should be called more frequently than
once every 10 minutes).

public static class SparkAppHandleListener implements
SparkAppHandle.Listener {
private final Context context;

public SparkAppHandleListener(Context context) {
this.context = context;
}

@Override
public void stateChanged(SparkAppHandle appHandle) {
this.context.setStatus("some update here");
}

@Override
public void infoChanged(SparkAppHandle appHandle) {
this.context.setStatus("some update here");
}
}

SparkLauncher sparkLauncher = ...
Context context = ...
sparkLauncher.startApplication(new SparkAppHandleListener(context));

Thanks,
Benson


Re: Nested ifs in sparksql

2017-01-10 Thread Georg Heiler
Maybe you can create an UDF?

Raghavendra Pandey  schrieb am Di., 10. Jan.
2017 um 20:04 Uhr:

> I have of around 41 level of nested if else in spark sql. I have
> programmed it using apis on dataframe. But it takes too much time.
> Is there anything I can do to improve on time here?
>


Nested ifs in sparksql

2017-01-10 Thread Raghavendra Pandey
I have of around 41 level of nested if else in spark sql. I have programmed
it using apis on dataframe. But it takes too much time.
Is there anything I can do to improve on time here?


Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Michael Gummelt
Just build with -Pmesos
http://spark.apache.org/docs/latest/building-spark.html#building-with-mesos-support

On Tue, Jan 10, 2017 at 8:56 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> I had the same problem, added spark-mesos as dependency and now I get :
> [2017-01-10 17:45:16,575] {bash_operator.py:77} INFO - Exception in thread
> "main" java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.mesos.MesosSchedulerDriver
> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$class.
> createSchedulerDriver(MesosSchedulerUtils.scala:105)
> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
> kend.createSchedulerDriver(MesosCoarseGrainedSchedulerBackend.scala:48)
> [2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
> kend.start(MesosCoarseGrainedSchedulerBackend.scala:155)
> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(
> TaskSchedulerImpl.scala:156)
> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
> org.apache.spark.SparkContext.(SparkContext.scala:509)
> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 6.apply(SparkSession.scala:868)
> [2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 6.apply(SparkSession.scala:860)
> [2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
> scala.Option.getOrElse(Option.scala:121)
> [2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:860)
>
> Is there any other dependency to add for spark 2.1.0 ?
>
>
>
> On Tue, Jan 10, 2017 1:26 AM, Abhishek Bhandari abhi10...@gmail.com wrote:
>
>> Glad that you found it.
>> ᐧ
>>
>> On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling 
>> wrote:
>>
>> Probably found it, it turns out that Mesos should be explicitly added
>> while building Spark, I assumed I could use the old build command that I
>> used for building Spark 2.0.0... Didn't see the two lines added in the
>> documentation...
>>
>> Maybe these kind of changes could be added in the changelog under changes
>> of behaviour or changes in the build process or something like that,
>>
>> kind regards,
>> Richard
>>
>>
>> On 9 January 2017 at 22:55, Richard Siebeling 
>> wrote:
>>
>> Hi,
>>
>> I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
>> parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
>> Mesos is running fine (both the master as the slave, it's a single
>> machine configuration).
>>
>> I really don't understand why this is happening since the same
>> configuration but using a Spark 2.0.0 is running fine within Vagrant.
>> Could someone please help?
>>
>> thanks in advance,
>> Richard
>>
>>
>>
>>
>>
>>
>>
>> --
>> *Abhishek J Bhandari*
>> Mobile No. +1 510 493 6205 <(510)%20493-6205> (USA)
>> Mobile No. +91 96387 93021 <+91%2096387%2093021> (IND)
>> *R & D Department*
>> *Valent Software Inc. CA*
>> Email: *abhis...@valent-software.com *
>>
>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>



-- 
Michael Gummelt
Software Engineer
Mesosphere


Shortest path performance in Graphx with Spark

2017-01-10 Thread Gerard Casey
Hello everyone,

I am creating a graph from a `gz` compressed `json` file of `edge` and 
`vertices` type.

I have put the files in a dropbox folder [here][1]

I load and map these `json` records to create the `vertices` and `edge` types 
required by `graphx` like this:

val vertices_raw = sqlContext.read.json("path/vertices.json.gz")
val vertices = vertices_raw.rdd.map(row=> 
((row.getAs[String]("toid").stripPrefix("osgb").toLong),row.getAs[Long]("index")))
val verticesRDD: RDD[(VertexId, Long)] = vertices
val edges_raw = sqlContext.read.json("path/edges.json.gz")
val edgesRDD = 
edges_raw.rdd.map(row=>(Edge(row.getAs[String]("positiveNode").stripPrefix("osgb").toLong,
 row.getAs[String]("negativeNode").stripPrefix("osgb").toLong, 
row.getAs[Double]("length"
val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, 
edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut)

I then use this `dijkstra` implementation I found to compute a shortest path 
between two vertices:

def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = {
  var g2 = g.mapVertices(
(vid, vd) => (false, if (vid == origin) 0 else Double.MaxValue, 
List[VertexId]())
  )
  for (i <- 1L to g.vertices.count - 1) {
val currentVertexId: VertexId = g2.vertices.filter(!_._2._1)
  .fold((0L, (false, Double.MaxValue, List[VertexId]((
(a, b) => if (a._2._2 < b._2._2) a else b)
  ._1

val newDistances: VertexRDD[(Double, List[VertexId])] =
  g2.aggregateMessages[(Double, List[VertexId])](
ctx => if (ctx.srcId == currentVertexId) {
  ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3 :+ 
ctx.srcId))
},
(a, b) => if (a._1 < b._1) a else b
  )
g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => {
  val newSumVal = newSum.getOrElse((Double.MaxValue, 
List[VertexId]()))
  (
vd._1 || vid == currentVertexId,
math.min(vd._2, newSumVal._1),
if (vd._2 < newSumVal._1) vd._3 else newSumVal._2
)
})
}

  g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
(vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]()))
  .productIterator.toList.tail
  ))
}

I take two random vertex id's:

val v1 = 400028222916L
val v2 = 400031019012L

and compute the path between them:

val results = dijkstra(my_graph, v1).vertices.map(_._2).collect

I am unable to compute this locally on my laptop without getting a 
stackoverflow error. I have 8GB RAM and 2.6 GHz Intel Core i5 processor. I can 
see that it is using 3 out of 4 cores available. I can load this graph and 
compute shortest on average around 10 paths per second with the `igraph` 
library in Python on exactly the same graph. Is this an inefficient means of 
computing paths? At scale, on multiple nodes the paths will compute (no 
stackoverflow error) but it is still 30/40seconds per path computation. I must 
be missing something. 

Thanks 

  [1]: https://www.dropbox.com/sh/9ug5ikr6j357q7j/AACDBR9UdM0g_ck_ykB8KXPXa?dl=0

Re: backward compatibility

2017-01-10 Thread Marco Mistroni
I think old APIs are still supported but u r advised to migrate
I migrated few apps from 1.6 to 2.0 with minimal changes
Hth

On 10 Jan 2017 4:14 pm, "pradeepbill"  wrote:

> hi there, I am using spark 1.4 code and now we plan to move to spark 2.0,
> and
> when I check the documentation below, there are only a few features
> backward
> compatible, does that mean I have change most of my code , please advice.
>
> One of the largest changes in Spark 2.0 is the new updated APIs:
>
> Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset
> have been unified, i.e. DataFrame is just a type alias for Dataset of Row.
> In Python and R, given the lack of type safety, DataFrame is the main
> programming interface.
> *SparkSession: new entry point that replaces the old SQLContext and
> HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are
> kept for backward compatibility.*
> A new, streamlined configuration API for SparkSession
> Simpler, more performant accumulator API
> A new, improved Aggregator API for typed aggregation in Datasets
>
>
> thanks
> Pradeep
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/backward-compatibility-tp28296.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
I had the same problem, added spark-mesos as dependency and now I get :
[2017-01-10 17:45:16,575] {bash_operator.py:77} INFO - Exception in thread
"main" java.lang.NoClassDefFoundError: Could not initialize class
org.apache.mesos.MesosSchedulerDriver[2017-01-10 17:45:16,576]
{bash_operator.py:77} INFO - at
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$class.createSchedulerDriver(MesosSchedulerUtils.scala:105)
[2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.createSchedulerDriver(MesosCoarseGrainedSchedulerBackend.scala:48)
[2017-01-10 17:45:16,576] {bash_operator.py:77} INFO - at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.start(MesosCoarseGrainedSchedulerBackend.scala:155)
[2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
[2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.SparkContext.(SparkContext.scala:509)[2017-01-10
17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)[2017-01-10
17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
[2017-01-10 17:45:16,577] {bash_operator.py:77} INFO - at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
[2017-01-10 17:45:16,578] {bash_operator.py:77} INFO - at
scala.Option.getOrElse(Option.scala:121)[2017-01-10 17:45:16,578]
{bash_operator.py:77} INFO - at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
Is there any other dependency to add for spark 2.1.0 ?
 





On Tue, Jan 10, 2017 1:26 AM, Abhishek Bhandari abhi10...@gmail.com
wrote:
Glad that you found it.ᐧ
On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling   wrote:
Probably found it, it turns out that Mesos should be explicitly added while
building Spark, I assumed I could use the old build command that I used for
building Spark 2.0.0... Didn't see the two lines added in the documentation...
Maybe these kind of changes could be added in the changelog under changes of
behaviour or changes in the build process or something like that,
kind regards,Richard

On 9 January 2017 at 22:55, Richard Siebeling   wrote:
Hi,
I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not parse
Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.Mesos is running fine (both the
master as the slave, it's a single machine configuration).
I really don't understand why this is happening since the same configuration but
using a Spark 2.0.0 is running fine within Vagrant.Could someone please help?
thanks in advance,Richard






-- 
Abhishek J BhandariMobile No. +1 510 493 6205 (USA)
Mobile No. +91 96387 93021 (IND)R & D DepartmentValent Software Inc. CAEmail: 
abhis...@valent-software.com

 

Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

backward compatibility

2017-01-10 Thread pradeepbill
hi there, I am using spark 1.4 code and now we plan to move to spark 2.0, and
when I check the documentation below, there are only a few features backward
compatible, does that mean I have change most of my code , please advice.

One of the largest changes in Spark 2.0 is the new updated APIs:

Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset
have been unified, i.e. DataFrame is just a type alias for Dataset of Row.
In Python and R, given the lack of type safety, DataFrame is the main
programming interface.
*SparkSession: new entry point that replaces the old SQLContext and
HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are
kept for backward compatibility.*
A new, streamlined configuration API for SparkSession
Simpler, more performant accumulator API
A new, improved Aggregator API for typed aggregation in Datasets


thanks
Pradeep



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/backward-compatibility-tp28296.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kryo On Spark 1.6.0

2017-01-10 Thread Yang Cao
If you don’t mind, could please share me with the scala solution? I tried to 
use kryo but seamed not work at all. I hope to get some practical example. THX
> On 2017年1月10日, at 19:10, Enrico DUrso  wrote:
> 
> Hi,
> 
> I am trying to use Kryo on Spark 1.6.0.
> I am able to register my own classes and it works, but when I set 
> “spark.kryo.registrationRequired “ to true, I get an error about a scala 
> class:
> “Class is not registered: scala.collection.mutable.WrappedArray$ofRef”.
> 
> Any of you has already solved this issue in Java? I found the code to solve 
> it in Scala, but unable to register this class in Java.
> 
> Cheers,
> 
> enrico
> 
> 
> CONFIDENTIALITY WARNING.
> This message and the information contained in or attached to it are private 
> and confidential and intended exclusively for the addressee. everis informs 
> to whom it may receive it in error that it contains privileged information 
> and its use, copy, reproduction or distribution is prohibited. If you are not 
> an intended recipient of this E-mail, please notify the sender, delete it and 
> do not read, act upon, print, disclose, copy, retain or redistribute any 
> portion of this E-mail.



Re: Spark Read from Google store and save in AWS s3

2017-01-10 Thread A Shaikh
This should help
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example

On 8 January 2017 at 03:49, neil90  wrote:

> Here is how you would read from Google Cloud Storage(note you need to
> create
> a service account key) ->
>
> os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars
> /home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell"""
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession, SQLContext
>
> conf = SparkConf()\
> .setMaster("local[8]")\
> .setAppName("GS")
>
> sc = SparkContext(conf=conf)
>
> sc._jsc.hadoopConfiguration().set("fs.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
> sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
> sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE
> PROJECT
> ID HERE")
>
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email",
> "testa...@sparkgcs.iam.gserviceaccount.com")
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable",
> "true")
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile",
> "sparkgcs-96bd21691c29.p12")
>
> spark = SparkSession.builder\
> .config(conf=sc.getConf())\
> .getOrCreate()
>
> dfTermRaw = spark.read.format("csv")\
> .option("header", "true")\
> .option("delimiter" ,"\t")\
> .option("inferSchema", "true")\
> .load("gs://bucket_test/sample.tsv")
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Read-from-Google-store-and-
> save-in-AWS-s3-tp28278p28286.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: Kryo On Spark 1.6.0

2017-01-10 Thread Enrico DUrso
Hi,

I agree with you Richard.
The point is that, looks like some classes which are used internally by Spark 
are not registered (for instance, the one I mentioned in the previous email is 
something I am not
directly using).
For those classes the serialization performance will be poor in according to 
how Spark works.
How can I register all those classes?

cheers,

From: Richard Startin [mailto:richardstar...@outlook.com]
Sent: 10 January 2017 11:18
To: Enrico DUrso; user@spark.apache.org
Subject: Re: Kryo On Spark 1.6.0


Hi Enrico,



Only set spark.kryo.registrationRequired if you want to forbid any classes you 
have not explicitly registered - see 
http://spark.apache.org/docs/latest/configuration.html.
Configuration - Spark 2.0.2 
Documentation
spark.apache.org
Spark Configuration. Spark Properties. Dynamically Loading Spark Properties; 
Viewing Spark Properties; Available Properties. Application Properties; Runtime 
Environment

To enable kryo, you just need 
spark.serializer=org.apache.spark.serializer.KryoSerializer. There is some info 
here - http://spark.apache.org/docs/latest/tuning.html

Cheers,
Richard



https://richardstartin.com/


From: Enrico DUrso >
Sent: 10 January 2017 11:10
To: user@spark.apache.org
Subject: Kryo On Spark 1.6.0


Hi,

I am trying to use Kryo on Spark 1.6.0.
I am able to register my own classes and it works, but when I set 
"spark.kryo.registrationRequired " to true, I get an error about a scala class:
"Class is not registered: scala.collection.mutable.WrappedArray$ofRef".

Any of you has already solved this issue in Java? I found the code to solve it 
in Scala, but unable to register this class in Java.

Cheers,

enrico



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Re: Kryo On Spark 1.6.0

2017-01-10 Thread Richard Startin
Hi Enrico,


Only set spark.kryo.registrationRequired if you want to forbid any classes you 
have not explicitly registered - see 
http://spark.apache.org/docs/latest/configuration.html.

Configuration - Spark 2.0.2 
Documentation
spark.apache.org
Spark Configuration. Spark Properties. Dynamically Loading Spark Properties; 
Viewing Spark Properties; Available Properties. Application Properties; Runtime 
Environment

To enable kryo, you just need 
spark.serializer=org.apache.spark.serializer.KryoSerializer. There is some info 
here - http://spark.apache.org/docs/latest/tuning.html

Cheers,
Richard



https://richardstartin.com/



From: Enrico DUrso 
Sent: 10 January 2017 11:10
To: user@spark.apache.org
Subject: Kryo On Spark 1.6.0


Hi,

I am trying to use Kryo on Spark 1.6.0.
I am able to register my own classes and it works, but when I set 
"spark.kryo.registrationRequired " to true, I get an error about a scala class:
"Class is not registered: scala.collection.mutable.WrappedArray$ofRef".

Any of you has already solved this issue in Java? I found the code to solve it 
in Scala, but unable to register this class in Java.

Cheers,

enrico



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Re: Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Julio Antonio Soto de Vicente
No. I am running Spark on YARN on a 3 node testing cluster. 

My guess is that given the amount of splits done by a hundred trees of depth 30 
(which should be more than 100 * 2^30), either the executors or the driver die 
OOM while trying to store all the split metadata. I guess that  the same issue 
affects both local and distributed modes. But those are just conjectures.

--
Julio

> El 10 ene 2017, a las 11:22, Marco Mistroni  escribió:
> 
> You running locally? Found exactly same issue.
> 2 solutions:
> _ reduce datA size.  
> _ run on EMR
> Hth
> 
>> On 10 Jan 2017 10:07 am, "Julio Antonio Soto"  wrote:
>> Hi, 
>> 
>> I am running into OOM problems while training a Spark ML 
>> RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).
>> 
>> My dataset is arguably pretty big given the executor count and size (8x5G), 
>> with approximately 20M rows and 130 features.
>> 
>> The "fun fact" is that a single DecisionTreeClassifier with the same specs 
>> (same maxDepth and maxBins) is able to train without problems in a couple of 
>> minutes.
>> 
>> AFAIK the current random forest implementation grows each tree sequentially, 
>> which means that DecisionTreeClassifiers are fit one by one, and therefore 
>> the training process should be similar in terms of memory consumption. Am I 
>> missing something here?
>> 
>> Thanks
>> Julio


Kryo On Spark 1.6.0

2017-01-10 Thread Enrico DUrso
Hi,

I am trying to use Kryo on Spark 1.6.0.
I am able to register my own classes and it works, but when I set 
"spark.kryo.registrationRequired " to true, I get an error about a scala class:
"Class is not registered: scala.collection.mutable.WrappedArray$ofRef".

Any of you has already solved this issue in Java? I found the code to solve it 
in Scala, but unable to register this class in Java.

Cheers,

enrico



CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and 
confidential and intended exclusively for the addressee. everis informs to whom 
it may receive it in error that it contains privileged information and its use, 
copy, reproduction or distribution is prohibited. If you are not an intended 
recipient of this E-mail, please notify the sender, delete it and do not read, 
act upon, print, disclose, copy, retain or redistribute any portion of this 
E-mail.


Re: spark-shell running out of memory even with 6GB ?

2017-01-10 Thread Sean Owen
Maybe ... here are a bunch of things I'd check:

Are you running out of memory, or just see a lot of mem usage? JVMs will
happily use all the memory you allow them even if some of it could be
reclaimed.

Did the driver run out of mem? did you give 6G to the driver or executor?

OOM errors do show where they occur of course, although often it tells you
exactly where it happened but not exactly why it happened. That's true of
any JVM.

Memory config is hard, yeah: you have to think about the resource manager's
config (e.g. YARN), the JVM's, and then Spark's. It's gotten simpler over
time but defaults invariably need tuning. Dynamic allocation can help to
some extent.

I don't know if a repartition() would run you out of memory.
The 2GB issue is mostly an artifact of byte[] arrays having a max length of
2^31-1. Fixing that is pretty hard, and yeah for now the usual advice is
"don't do that" -- find ways to avoid huge allocations because it's
probably a symptom of performance bottlenecks anyway.

On Tue, Jan 10, 2017 at 2:21 AM Kevin Burton  wrote:

> Ah.. ok. I think I know what's happening now. I think we found this
> problem when running a job and doing a repartition()
>
> Spark is just way way way too sensitive to memory configuration.
>
> The 2GB per shuffle limit is also insanely silly in 2017.
>
> So I think what we did is did a repartition too large and now we ran out
> of memory in spark shell.
>
> On Mon, Jan 9, 2017 at 5:53 PM, Steven Ruppert 
> wrote:
>
> The spark-shell process alone shouldn't take up that much memory, at least
> in my experience. Have you dumped the heap to see what's all in there? What
> environment are you running spark in?
>
> Doing stuff like RDD.collect() or .countByKey will pull potentially a lot
> of data the spark-shell heap. Another thing thing that can fill up the
> spark master process heap (which is also run in the spark-shell process) is
> running lots of jobs, the logged SparkEvents of which stick around in order
> for the UI to render. There are some options under `spark.ui.retained*` to
> limit that if it's a problem.
>
>
> On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton  wrote:
>
> We've had various OOM issues with spark and have been trying to track them
> down one by one.
>
> Now we have one in spark-shell which is super surprising.
>
> We currently allocate 6GB to spark shell, as confirmed via 'ps'
>
> Why the heck would the *shell* need that much memory.
>
> I'm going to try to give it more of course but would be nice to know if
> this is a legitimate memory constraint or there is a bug somewhere.
>
> PS: One thought I had was that it would be nice to have spark keep track
> of where an OOM was encountered, in what component.
>
> Kevin
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>
>
> *CONFIDENTIALITY NOTICE: This email message, and any documents, files or
> previous e-mail messages attached to it is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.*
>
>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Re: Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Marco Mistroni
You running locally? Found exactly same issue.
2 solutions:
_ reduce datA size.
_ run on EMR
Hth

On 10 Jan 2017 10:07 am, "Julio Antonio Soto"  wrote:

> Hi,
>
> I am running into OOM problems while training a Spark ML
> RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).
>
> My dataset is arguably pretty big given the executor count and size
> (8x5G), with approximately 20M rows and 130 features.
>
> The "fun fact" is that a single DecisionTreeClassifier with the same specs
> (same maxDepth and maxBins) is able to train without problems in a couple
> of minutes.
>
> AFAIK the current random forest implementation grows each tree
> sequentially, which means that DecisionTreeClassifiers are fit one by one,
> and therefore the training process should be similar in terms of memory
> consumption. Am I missing something here?
>
> Thanks
> Julio
>


Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Julio Antonio Soto
Hi,

I am running into OOM problems while training a Spark ML
RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).

My dataset is arguably pretty big given the executor count and size (8x5G),
with approximately 20M rows and 130 features.

The "fun fact" is that a single DecisionTreeClassifier with the same specs
(same maxDepth and maxBins) is able to train without problems in a couple
of minutes.

AFAIK the current random forest implementation grows each tree
sequentially, which means that DecisionTreeClassifiers are fit one by one,
and therefore the training process should be similar in terms of memory
consumption. Am I missing something here?

Thanks
Julio


Spark JDBC Data type mapping (Float and smallInt) Issue

2017-01-10 Thread santlal56
Hi,
I am new to spark scala development.I have created job to read data from
mysql table using existing data source API(Spark jdbc). I have question
regarding data type mapping, these are as follows.
*Scenario 1:*
I have created table with float type in mysql but while reading through
spark jdbc i am getting DoubleType.

*Scenario 2:* 
I have created table with SMALLINT type in mysql but while reading through
spark jdbc i am getting IntegerType.

These mapping had done in/
*org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD*/ object.
Here my question is : 
*1. Why is  Float mapped to DoubleType and SMALLINT mapped to IntegerType?
2. Why can we not handle Float and SMALLINT in  MySQLDialect, as Binary is
already handled?*

I am using below version of jar : 
mysql-connector-java-5.1.34.jar
spark-core_2.11 version: '2.0.2'
spark-sql_2.11  version: '2.0.2'

Thanks
Santlal J. Gupta



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-JDBC-Data-type-mapping-Float-and-smallInt-Issue-tp28295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org