Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mike Metzger
Hi Kevin -

   There's not really a race condition as the 64 bit value is split into a
31 bit partition id (the upper portion) and a 33 bit incrementing id.  In
other words, as long as each partition contains fewer than 8 billion
entries there should be no overlap and there is not any communication
between executors to get the next id.

Depending on what you mean by duplication, there shouldn't be any within a
column as long as you maintain some sort of state (ie, the startval Mich
shows, a previous maxid, etc.)  While these ids are unique in that sense,
they are not the same as a uuid / guid which are generally unique across
all entries assuming enough randomness.  Think of the monotonically
increasing id as an auto-incrementing column (with potentially massive gaps
in ids) from a relational database.

Thanks

Mike


On Sun, Sep 4, 2016 at 6:41 PM, Kevin Tran  wrote:

> Hi Mich,
> Thank you for your input.
> Does monotonically incremental ensure about race condition and does it
> duplicates the ids at some points with multi threads, multi instances, ... ?
>
> Even System.currentTimeMillis() still has duplication?
>
> Cheers,
> Kevin.
>
> On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You can create a monotonically incrementing ID column on your table
>>
>> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
>> scala> val startval = 1
>> scala> val df = ll_18740868.withColumn("id",
>> *monotonically_increasing_id()+* startval).show (2)
>> +---+---+-+-+---
>> ---+---++---+---+
>> |transactiondate|transactiontype| sortcode|accountnumber|transac
>> tiondescription|debitamount|creditamount|balance| id|
>> +---+---+-+-+---
>> ---+---++---+---+
>> | 2011-12-30|DEB|'30-64-72| 18740868|  WWW.GFT.COM
>> CD 4628 |   50.0|null| 304.89|  1|
>> | 2011-12-30|DEB|'30-64-72| 18740868|
>> TDA.CONFECC.D.FRE...|  19.01|null| 354.89|  2|
>> +---+---+-+-+---
>> ---+---++---+---+
>>
>>
>> Now you have a new ID column
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 September 2016 at 12:43, Kevin Tran  wrote:
>>
>>> Hi everyone,
>>> Please give me your opinions on what is the best ID Generator for ID
>>> field in parquet ?
>>>
>>> UUID.randomUUID();
>>> AtomicReference currentTime = new AtomicReference<>(System.curre
>>> ntTimeMillis());
>>> AtomicLong counter = new AtomicLong(0);
>>> 
>>>
>>> Thanks,
>>> Kevin.
>>>
>>>
>>> 
>>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>>> writing Parquet files)
>>> https://github.com/apache/spark/pull/6864/files
>>>
>>
>>
>


Re: difference between package and jar Option in Spark

2016-09-04 Thread Tal Grynbaum
You need to download all the dependencies of that jar as well

On Mon, Sep 5, 2016, 06:59 Divya Gehlot  wrote:

> Hi,
> I am using spark-csv to parse my input files .
> If I use --package option it works fine but if I download
> 
> the jar and use --jars option
> Its throwing Class not found exception.
>
>
> Thanks,
> Divya
>
> On 1 September 2016 at 17:26, Sean Owen  wrote:
>
>> --jars includes a local JAR file in the application's classpath.
>> --package references Maven coordinates of a dependency and retrieves
>> and includes all of those JAR files, and includes them in the app
>> classpath.
>>
>> On Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot 
>> wrote:
>> > Hi,
>> >
>> > Would like to know the difference between the --package and --jars
>> option in
>> > Spark .
>> >
>> >
>> >
>> > Thanks,
>> > Divya
>>
>
>


Re: difference between package and jar Option in Spark

2016-09-04 Thread Divya Gehlot
Hi,
I am using spark-csv to parse my input files .
If I use --package option it works fine but if I download

the jar and use --jars option
Its throwing Class not found exception.


Thanks,
Divya

On 1 September 2016 at 17:26, Sean Owen  wrote:

> --jars includes a local JAR file in the application's classpath.
> --package references Maven coordinates of a dependency and retrieves
> and includes all of those JAR files, and includes them in the app
> classpath.
>
> On Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot 
> wrote:
> > Hi,
> >
> > Would like to know the difference between the --package and --jars
> option in
> > Spark .
> >
> >
> >
> > Thanks,
> > Divya
>


Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Holden Karau
You really shouldn't mix different versions of Spark between the master and
worker nodes, if your going to upgrade - upgrade all of them. Otherwise you
may get very confusing failures.

On Monday, September 5, 2016, Rex X  wrote:

> Wish to use the Pivot Table feature of data frame which is available since
> Spark 1.6. But the spark of current cluster is version 1.5. Can we install
> Spark 2.0 on the master node to work around this?
>
> Thanks!
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread 刘虓
Hi,
I think you can refer to spark history server to figure out how the time
was spent.

2016-09-05 10:36 GMT+08:00 xiefeng :

> The spark context will be reused, so the spark context initialization won't
> affect the throughput test.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Why-does-spark-take-so-much-time-
> for-simple-task-without-calculation-tp27628p27657.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread xiefeng
The spark context will be reused, so the spark context initialization won't
affect the throughput test.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread xiefeng
My Detail test process:
1.   In initialization, it will create 100 string RDDs and distribute
them in spark workers.
for (int i = 1; i <= numOfRDDs; i++) {
JavaRDD rddData =
sc.parallelize(Arrays.asList(Integer.toString(i))).coalesce(1);
rddData.cache().count();
simpleRDDs.put(Integer.toString(i), rddData);
}
2.   In Jmeter, configure 100 threads and loop 100 times, each thread
will send the get method use its number as RDDId:

3.   This function simply return the RDD string, note: the dictionary
simpleRDDs is initialized at first with 100 RDDs.
   public static String simpleRDDTest(String keyOfRDD) {
JavaRDD rddData = simpleRDDs.get(keyOfRDD);
return rddData.first();
}
 
4.   Test three cases for different number of workers:
During the test, I run several times to get the stable throughput. 
The throughput in three cases vary between 85-95/sec. There is no
significantly difference between different worker number.
5.   I think this result means even if there is no calculation, the
through put has a limitation because spark job initialization and dispatch.
Add more workers can’t help improve this situation. Is anyone can explain
this?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread Xie, Feng
Hi Aliaksandr,

Thank you very much for your answer.
And in my test, I would reuse the spark context, it is initialized when I start 
the application, for the later throughput test, it won't be initialized again. 
And when I increase the number of workers, the through put doesn't increase.
I read the link you post, it only described use command line tools that faster 
than Hadoop cluster, I didn't really get the key point that explain my 
question. 
If spark context initialization isn't affect my test case, is there anything 
else? Does the job initialization or dispatch take time? Thank you!


-Original Message-
From: Bedrytski Aliaksandr [mailto:sp...@bedryt.ski] 
Sent: Wednesday, August 31, 2016 8:45 PM
To: Xie, Feng
Cc: user@spark.apache.org
Subject: Re: Why does spark take so much time for simple task without 
calculation?

Hi xiefeng,

Spark Context initialization takes some time and the tool does not really shine 
for small data computations:
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

But, when working with terabytes (petabytes) of data, those 35 seconds of 
initialization don't really matter. 

Regards,

--
  Bedrytski Aliaksandr
  sp...@bedryt.ski

On Wed, Aug 31, 2016, at 11:45, xiefeng wrote:
> I install a spark standalone and run the spark cluster(one master and 
> one
> worker) in a windows 2008 server with 16cores and 24GB memory.
> 
> I have done a simple test: Just create  a string RDD and simply return 
> it. I use JMeter to test throughput but the highest is around 35/sec. 
> I think spark is powerful at distribute calculation, but why the 
> throughput is so limit in such simple test scenario only contains 
> simple task dispatch and no calculation?
> 
> 1. In JMeter I test both 10 threads or 100 threads, there is little 
> difference around 2-3/sec.
> 2. I test both cache/not cache the RDD, there is little difference. 
> 3. During the test, the cpu and memory are in low level.
> 
> Below is my test code:
> @RestController
> public class SimpleTest {   
>   @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
>   @ResponseBody
>   public String testProcessTransaction() {
>   return SparkShardTest.simpleRDDTest();
>   }
> }
> 
> final static Map> simpleRDDs = 
> initSimpleRDDs(); public static Map> 
> initSimpleRDDs()
>   {
>   Map> result = new 
> ConcurrentHashMap>();
>   JavaRDD rddData = JavaSC.parallelize(data);
>   rddData.cache().count();//this cache will improve 1-2/sec
>   result.put("MyRDD", rddData);
>   return result;
>   }
>   
>   public static String simpleRDDTest()
>   {   
>   JavaRDD rddData = simpleRDDs.get("MyRDD");
>   return rddData.first();
>   }
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-tak
> e-so-much-time-for-simple-task-without-calculation-tp27628.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi Mich,
Thank you for your input.
Does monotonically incremental ensure about race condition and does it
duplicates the ids at some points with multi threads, multi instances, ... ?

Even System.currentTimeMillis() still has duplication?

Cheers,
Kevin.

On Mon, Sep 5, 2016 at 12:30 AM, Mich Talebzadeh 
wrote:

> You can create a monotonically incrementing ID column on your table
>
> scala> val ll_18740868 = spark.table("accounts.ll_18740868")
> scala> val startval = 1
> scala> val df = ll_18740868.withColumn("id",
> *monotonically_increasing_id()+* startval).show (2)
> +---+---+-+-+---
> ---+---++---+---+
> |transactiondate|transactiontype| sortcode|accountnumber|
> transactiondescription|debitamount|creditamount|balance| id|
> +---+---+-+-+---
> ---+---++---+---+
> | 2011-12-30|DEB|'30-64-72| 18740868|  WWW.GFT.COM CD
> 4628 |   50.0|null| 304.89|  1|
> | 2011-12-30|DEB|'30-64-72| 18740868|
> TDA.CONFECC.D.FRE...|  19.01|null| 354.89|  2|
> +---+---+-+-+---
> ---+---++---+---+
>
>
> Now you have a new ID column
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 September 2016 at 12:43, Kevin Tran  wrote:
>
>> Hi everyone,
>> Please give me your opinions on what is the best ID Generator for ID
>> field in parquet ?
>>
>> UUID.randomUUID();
>> AtomicReference currentTime = new AtomicReference<>(System.curre
>> ntTimeMillis());
>> AtomicLong counter = new AtomicLong(0);
>> 
>>
>> Thanks,
>> Kevin.
>>
>>
>> 
>> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
>> writing Parquet files)
>> https://github.com/apache/spark/pull/6864/files
>>
>
>


Re: spark cassandra issue

2016-09-04 Thread Selvam Raman
Hi Russell.

if possible pleae help me to solve the below issue.

val df = sqlContext.read.
format("org.apache.spark.sql.cassandra").
options(Map("c_table"->"restt","keyspace"->"sss")).
load()


com.datastax.driver.core.TransportException: [/192.23.2.100:9042] Cannot
connect
at com.datastax.driver.core.Connection.(Connection.java:109)
at com.datastax.driver.core.PooledConnection.(
PooledConnection.java:32)
at com.datastax.driver.core.Connection$Factory.open(
Connection.java:586)
at com.datastax.driver.core.DynamicConnectionPool.(
DynamicConnectionPool.java:74)
at com.datastax.driver.core.HostConnectionPool.newInstance(
HostConnectionPool.java:33)
at com.datastax.driver.core.SessionManager.replacePool(
SessionManager.java:271)
at com.datastax.driver.core.SessionManager.access$400(
SessionManager.java:40)
at com.datastax.driver.core.SessionManager$3.call(
SessionManager.java:308)
at com.datastax.driver.core.SessionManager$3.call(
SessionManager.java:300)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jboss.netty.channel.ConnectTimeoutException: connection
timed out: /192.168.2.100:9042
at org.jboss.netty.channel.socket.nio.NioClientBoss.
processConnectTimeout(NioClientBoss.java:137)
at org.jboss.netty.channel.socket.nio.NioClientBoss.
process(NioClientBoss.java:83)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(
AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(
NioClientBoss.java:42)
... 3 more
16/09/04 18:37:35 ERROR core.Session: Error creating pool to /
192.168.2.74:9042
com.datastax.driver.core.TransportException: [/192.28.2.74:9042] Cannot
connect
at com.datastax.driver.core.Connection.(Connection.java:109)
at com.datastax.driver.core.PooledConnection.(
PooledConnection.java:32)
at com.datastax.driver.core.Connection$Factory.open(
Connection.java:586)
at com.datastax.driver.core.DynamicConnectionPool.(
DynamicConnectionPool.java:74)
at com.datastax.driver.core.HostConnectionPool.newInstance(
HostConnectionPool.java:33)
at com.datastax.driver.core.SessionManager.replacePool(
SessionManager.java:271)
at com.datastax.driver.core.SessionManager.access$400(
SessionManager.java:40)
at com.datastax.driver.core.SessionManager$3.call(
SessionManager.java:308)
at com.datastax.driver.core.SessionManager$3.call(
SessionManager.java:300)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jboss.netty.channel.ConnectTimeoutException: connection
timed out: /192.168.2.74:9042
at org.jboss.netty.channel.socket.nio.NioClientBoss.
processConnectTimeout(NioClientBoss.java:137)
at org.jboss.netty.channel.socket.nio.NioClientBoss.
process(NioClientBoss.java:83)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(
AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(
NioClientBoss.java:42)
... 3 more
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/sql/types/PrimitiveType
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(
SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(
SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
 

Problem in accessing swebhdfs

2016-09-04 Thread Sourav Mazumder
Hi,

When I try to access a swebhdfs uri I get following error.

In my hadoop cluster webhdfs is enabled.

Also I can access the same resource using webhdfs API from a http client
with SSL.

Any idea what is going wring ?

Regards,
Sourav

java.io.IOException: Unexpected HTTP response: code=404 != 200,
op=GETFILESTATUS, message=Not Found
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:347)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:613)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:463)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:492)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:488)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:848)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:858)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674)
at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
... 50 elided
Caused by: java.io.IOException: Content-Type "text/html;charset=ISO-8859-1"
is incompatible with "application/json"
(parsed="text/html;charset=ISO-8859-1")
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.jsonParse(WebHdfsFileSystem.java:320)
at
org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:343)
... 78 more


Reuters Market Data System connection to Spark Streaming

2016-09-04 Thread Mich Talebzadeh
Hi,

Has anyone had experience of using such messaging system like Kafka to
connect Reuters Market Data System to Spark Streaming by any chance.

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Resources for learning Spark administration

2016-09-04 Thread Mich Talebzadeh
Hi,

There are a lof of stuff to cover here depending on the business and your
needs

Do you mean:


   1. Hardware spec for Spark master and nodes
   2. The number of nodes, How to scale the nodes
   3. Where to set up Spark nodes, on the same Hardware nodes as HDFS
   (assuming using Hadoop) or on the same subnet
   4. The network bandwidth between Spark cluster and Hadoop cluster
   5. The mode of operations; Local. Standalone, Yarn (cluster and client)
   etc.
   6. General Spark admin, job and log monitoring
   7. Security admin
   8. Setting up and configure Spark Thrift Servers (STS) and how many
   (multiple STSs on the same host, different nodes etc)
   9. Host of other matters

Spark online documents are a good way to start.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 September 2016 at 19:34, Somasundaram Sekar <
somasundar.se...@tigeranalytics.com> wrote:

> Please suggest some good resources to learn Spark administration.
>


Resources for learning Spark administration

2016-09-04 Thread Somasundaram Sekar
Please suggest some good resources to learn Spark administration.


Re: S3A + EMR failure when writing Parquet?

2016-09-04 Thread Everett Anderson
Hey,

Thanks for the reply and sorry for the late response!

I haven't been able to figure out the root cause, but I have been able to
get things working if both the cluster and the remote submitter use S3A
instead of EMRFS for all s3:// interactions, so I'm going with that, for
now.

My impression from reading your various other replies on S3A is that it's
also best to use mapreduce.fileoutputcommitter.algorithm.version=2 (which might
someday be the default
) and, presumably if
your data fits well in memory, use fs.s3a.fast.upload=true. Is that right?



On Tue, Aug 30, 2016 at 11:49 AM, Steve Loughran 
wrote:

>
> On 29 Aug 2016, at 18:18, Everett Anderson  > wrote:
>
> Okay, I don't think it's really just S3A issue, anymore. I can run the job
> using fs.s3.impl/spark.hadoop.fs.s3.impl set to the S3A impl as a --conf
> param from the EMR console successfully, as well.
>
> The problem seems related to the fact that we're trying to spark-submit
> jobs to a YARN cluster from outside the cluster itself.
>
> The docs 
> suggest one must copy the Hadoop/YARN config XML outside of the cluster to
> do this, which feels gross, but it's what we did. We had changed fs.s3.impl
> to use S3A in that config, and that seems to result in the failure, though
> I still can't figure out why.
>
> Interestingly, if I don't make that change to the XML, and leave it as the
> EMRFS implementation, it will work, as long as I use s3a:// URIs for the
> jar, otherwise spark-submit won't be able to ship them to the cluster since
> it won't have the EMRFS implementation locally.
>
>
> I see: you are trying to use EMR's "special" S3 in-cluster, but
> spark-submit is trying to submit remotely.
>
> 1.  Trying to change the value of fs.s3.impl to S3a works for upload, but
> not runtime
> 2. use s3a for the upload, leave things alone and all works.
>
> I would just go with S3a, this is just the JARs being discussed here right
> —not the actual data?
>
> When the JARs are needed, they'll be copied on EMR using the amazon S3A
> implementation —whatever they've done there— to the local filesystem, where
> classloaders can pick them up and use. It might be that s3a:// URLs are
> slower on EMR than s3:// URLs, but there's no fundamental reason wny it
> isn't going to work.
>
>
>
>
>
> On Sun, Aug 28, 2016 at 4:19 PM, Everett Anderson 
> wrote:
>
>> (Sorry, typo -- I was using spark.hadoop.mapreduce.f
>> ileoutputcommitter.algorithm.version=2 not 'hadooop', of course)
>>
>> On Sun, Aug 28, 2016 at 12:51 PM, Everett Anderson 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm having some trouble figuring out a failure when using S3A when
>>> writing a DataFrame as Parquet on EMR 4.7.2 (which is Hadoop 2.7.2 and
>>> Spark 1.6.2). It works when using EMRFS (s3://), though.
>>>
>>> I'm using these extra conf params, though I've also tried without
>>> everything but the encryption one with the same result:
>>>
>>> --conf spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version=2
>>> --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
>>> --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256
>>>
>> --conf spark.sql.parquet.output.committer.class=org.apache.spark.sq
>>> l.parquet.DirectParquetOutputCommitter
>>>
>>> It looks like it does actually write the parquet shards under
>>>
>>> /_temporary/0/_temporary//
>>>
>>> but then must hit that S3 exception when trying to copy/rename. I think
>>> the NullPointerException deep down in Parquet is due to it causing close()
>>> more than once so isn't the root cause, but I'm not sure.
>>>
>>
> given the stack trace has abortTask() in it, I'd suspect that's a
> follow-on failure.
>
>
>
> One possibility here may be related to how EMR will handle your
> credentials (session credentials served up over IAM HTTP) and how Apache
> Hadoop 2.7's s3a auth works (IAM isn't supported until 2.8). That could
> trigger the problem. But I don't know.
>
> I do know that I have dataframes writing back to s3a on Hadoop 2.7.3, *not
> on EMR*.
>
>
>
>>> Anyone seen something like this?
>>>
>>> 16/08/28 19:46:28 ERROR InsertIntoHadoopFsRelation: Aborting job.
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 
>>> in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 
>>> 1.0 (TID 54, 
>>> ip-10-8-38-103.us-west-2.computk.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:266)
>>> ... 8 more
>>> Suppressed: java.lang.NullPointerException
>>> at 
>>> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
>>> at 
>>> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>>> 

Re: spark cassandra issue

2016-09-04 Thread Russell Spitzer
This would also be a better question for the SCC user list :)
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user

On Sun, Sep 4, 2016 at 9:31 AM Russell Spitzer 
wrote:

>
> https://github.com/datastax/spark-cassandra-connector/blob/v1.3.1/doc/14_data_frames.md
> In Spark 1.3 it was illegal to use "table" as a key in Spark SQL so in
> that version of Spark the connector needed to use the option "c_table"
>
>
> val df = sqlContext.read.
>  | format("org.apache.spark.sql.cassandra").
>  | options(Map( "c_table" -> "", "keyspace" -> "***")).
>  | load()
>
>
> On Sun, Sep 4, 2016 at 8:32 AM Mich Talebzadeh 
> wrote:
>
>> and your Cassandra table is there etc?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 September 2016 at 16:20, Selvam Raman  wrote:
>>
>>> Hey Mich,
>>>
>>> I am using the same one right now. Thanks for the reply.
>>> import org.apache.spark.sql.cassandra._
>>> import com.datastax.spark.connector._ //Loads implicit functions
>>> sc.cassandraTable("keyspace name", "table name")
>>>
>>>
>>> On Sun, Sep 4, 2016 at 8:48 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Selvan.

 I don't deal with Cassandra but have you tried other options as
 described here


 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

 To get a Spark RDD that represents a Cassandra table, call the
 cassandraTable method on the SparkContext object.

 import com.datastax.spark.connector._ //Loads implicit functions
 sc.cassandraTable("keyspace name", "table name")



 HTH


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 4 September 2016 at 15:52, Selvam Raman  wrote:

> its very urgent. please help me guys.
>
> On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman  wrote:
>
>> Please help me to solve the issue.
>>
>> spark-shell --packages
>> com.datastax.spark:spark-cassandra-connector_2.10:1.3.0 --conf
>> spark.cassandra.connection.host=**
>>
>> val df = sqlContext.read.
>>  | format("org.apache.spark.sql.cassandra").
>>  | options(Map( "table" -> "", "keyspace" -> "***")).
>>  | load()
>> java.util.NoSuchElementException: key not found: c_table
>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>> at
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:151)
>> at scala.collection.MapLike$class.apply(MapLike.scala:141)
>> at
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:151)
>> at
>> org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:120)
>> at
>> org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>> a
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


>>>
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>>


Re: spark cassandra issue

2016-09-04 Thread Russell Spitzer
https://github.com/datastax/spark-cassandra-connector/blob/v1.3.1/doc/14_data_frames.md
In Spark 1.3 it was illegal to use "table" as a key in Spark SQL so in that
version of Spark the connector needed to use the option "c_table"

val df = sqlContext.read.
 | format("org.apache.spark.sql.cassandra").
 | options(Map( "c_table" -> "", "keyspace" -> "***")).
 | load()


On Sun, Sep 4, 2016 at 8:32 AM Mich Talebzadeh 
wrote:

> and your Cassandra table is there etc?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 September 2016 at 16:20, Selvam Raman  wrote:
>
>> Hey Mich,
>>
>> I am using the same one right now. Thanks for the reply.
>> import org.apache.spark.sql.cassandra._
>> import com.datastax.spark.connector._ //Loads implicit functions
>> sc.cassandraTable("keyspace name", "table name")
>>
>>
>> On Sun, Sep 4, 2016 at 8:48 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Selvan.
>>>
>>> I don't deal with Cassandra but have you tried other options as
>>> described here
>>>
>>>
>>> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
>>>
>>> To get a Spark RDD that represents a Cassandra table, call the
>>> cassandraTable method on the SparkContext object.
>>>
>>> import com.datastax.spark.connector._ //Loads implicit functions
>>> sc.cassandraTable("keyspace name", "table name")
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 4 September 2016 at 15:52, Selvam Raman  wrote:
>>>
 its very urgent. please help me guys.

 On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman  wrote:

> Please help me to solve the issue.
>
> spark-shell --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.3.0 --conf
> spark.cassandra.connection.host=**
>
> val df = sqlContext.read.
>  | format("org.apache.spark.sql.cassandra").
>  | options(Map( "table" -> "", "keyspace" -> "***")).
>  | load()
> java.util.NoSuchElementException: key not found: c_table
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:151)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:151)
> at
> org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:120)
> at
> org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
> a
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>



 --
 Selvam Raman
 "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

>>>
>>>
>>
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>
>


Re: Spark transformations

2016-09-04 Thread janardhan shetty
In scala Spark ML Dataframes.

On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar <
somasundar.se...@tigeranalytics.com> wrote:

> Can you try this
>
> https://www.linkedin.com/pulse/hive-functions-udfudaf-
> udtf-examples-gaurav-singh
>
> On 4 Sep 2016 9:38 pm, "janardhan shetty"  wrote:
>
>> Hi,
>>
>> Is there any chance that we can send entire multiple columns to an udf
>> and generate a new column for Spark ML.
>> I see similar approach as VectorAssembler but not able to use few classes
>> /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they
>> are private.
>>
>> Any leads/examples is appreciated in this regard..
>>
>> Requirement:
>> *Input*: Multiple columns of a Dataframe
>> *Output*:  Single new modified column
>>
>


Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Rex X
Wish to use the Pivot Table feature of data frame which is available since
Spark 1.6. But the spark of current cluster is version 1.5. Can we install
Spark 2.0 on the master node to work around this?

Thanks!


Re: Spark transformations

2016-09-04 Thread Somasundaram Sekar
Can you try this

https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-examples-gaurav-singh

On 4 Sep 2016 9:38 pm, "janardhan shetty"  wrote:

> Hi,
>
> Is there any chance that we can send entire multiple columns to an udf and
> generate a new column for Spark ML.
> I see similar approach as VectorAssembler but not able to use few classes
> /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they
> are private.
>
> Any leads/examples is appreciated in this regard..
>
> Requirement:
> *Input*: Multiple columns of a Dataframe
> *Output*:  Single new modified column
>


Spark transformations

2016-09-04 Thread janardhan shetty
Hi,

Is there any chance that we can send entire multiple columns to an udf and
generate a new column for Spark ML.
I see similar approach as VectorAssembler but not able to use few classes
/traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they
are private.

Any leads/examples is appreciated in this regard..

Requirement:
*Input*: Multiple columns of a Dataframe
*Output*:  Single new modified column


Re: spark cassandra issue

2016-09-04 Thread Mich Talebzadeh
and your Cassandra table is there etc?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 September 2016 at 16:20, Selvam Raman  wrote:

> Hey Mich,
>
> I am using the same one right now. Thanks for the reply.
> import org.apache.spark.sql.cassandra._
> import com.datastax.spark.connector._ //Loads implicit functions
> sc.cassandraTable("keyspace name", "table name")
>
>
> On Sun, Sep 4, 2016 at 8:48 PM, Mich Talebzadeh  > wrote:
>
>> Hi Selvan.
>>
>> I don't deal with Cassandra but have you tried other options as described
>> here
>>
>> https://github.com/datastax/spark-cassandra-connector/blob/
>> master/doc/2_loading.md
>>
>> To get a Spark RDD that represents a Cassandra table, call the
>> cassandraTable method on the SparkContext object.
>>
>> import com.datastax.spark.connector._ //Loads implicit functions
>> sc.cassandraTable("keyspace name", "table name")
>>
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 September 2016 at 15:52, Selvam Raman  wrote:
>>
>>> its very urgent. please help me guys.
>>>
>>> On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman  wrote:
>>>
 Please help me to solve the issue.

 spark-shell --packages 
 com.datastax.spark:spark-cassandra-connector_2.10:1.3.0
 --conf spark.cassandra.connection.host=**

 val df = sqlContext.read.
  | format("org.apache.spark.sql.cassandra").
  | options(Map( "table" -> "", "keyspace" -> "***")).
  | load()
 java.util.NoSuchElementException: key not found: c_table
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at org.apache.spark.sql.execution.datasources.CaseInsensitiveMa
 p.default(ddl.scala:151)
 at scala.collection.MapLike$class.apply(MapLike.scala:141)
 at org.apache.spark.sql.execution.datasources.CaseInsensitiveMa
 p.apply(ddl.scala:151)
 at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOpt
 ions(DefaultSource.scala:120)
 at org.apache.spark.sql.cassandra.DefaultSource.createRelation(
 DefaultSource.scala:56)
 at org.apache.spark.sql.execution.datasources.ResolvedDataSourc
 e$.apply(ResolvedDataSource.scala:125)
 a

 --
 Selvam Raman
 "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

>>>
>>>
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: spark cassandra issue

2016-09-04 Thread Selvam Raman
Hey Mich,

I am using the same one right now. Thanks for the reply.
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector._ //Loads implicit functions
sc.cassandraTable("keyspace name", "table name")


On Sun, Sep 4, 2016 at 8:48 PM, Mich Talebzadeh 
wrote:

> Hi Selvan.
>
> I don't deal with Cassandra but have you tried other options as described
> here
>
> https://github.com/datastax/spark-cassandra-connector/
> blob/master/doc/2_loading.md
>
> To get a Spark RDD that represents a Cassandra table, call the
> cassandraTable method on the SparkContext object.
>
> import com.datastax.spark.connector._ //Loads implicit functions
> sc.cassandraTable("keyspace name", "table name")
>
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 September 2016 at 15:52, Selvam Raman  wrote:
>
>> its very urgent. please help me guys.
>>
>> On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman  wrote:
>>
>>> Please help me to solve the issue.
>>>
>>> spark-shell --packages 
>>> com.datastax.spark:spark-cassandra-connector_2.10:1.3.0
>>> --conf spark.cassandra.connection.host=**
>>>
>>> val df = sqlContext.read.
>>>  | format("org.apache.spark.sql.cassandra").
>>>  | options(Map( "table" -> "", "keyspace" -> "***")).
>>>  | load()
>>> java.util.NoSuchElementException: key not found: c_table
>>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>>> at org.apache.spark.sql.execution.datasources.CaseInsensitiveMa
>>> p.default(ddl.scala:151)
>>> at scala.collection.MapLike$class.apply(MapLike.scala:141)
>>> at org.apache.spark.sql.execution.datasources.CaseInsensitiveMa
>>> p.apply(ddl.scala:151)
>>> at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOpt
>>> ions(DefaultSource.scala:120)
>>> at org.apache.spark.sql.cassandra.DefaultSource.createRelation(
>>> DefaultSource.scala:56)
>>> at org.apache.spark.sql.execution.datasources.ResolvedDataSourc
>>> e$.apply(ResolvedDataSource.scala:125)
>>> a
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>>
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>
>


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: spark cassandra issue

2016-09-04 Thread Mich Talebzadeh
Hi Selvan.

I don't deal with Cassandra but have you tried other options as described
here

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

To get a Spark RDD that represents a Cassandra table, call the
cassandraTable method on the SparkContext object.

import com.datastax.spark.connector._ //Loads implicit functions
sc.cassandraTable("keyspace name", "table name")



HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 September 2016 at 15:52, Selvam Raman  wrote:

> its very urgent. please help me guys.
>
> On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman  wrote:
>
>> Please help me to solve the issue.
>>
>> spark-shell --packages 
>> com.datastax.spark:spark-cassandra-connector_2.10:1.3.0
>> --conf spark.cassandra.connection.host=**
>>
>> val df = sqlContext.read.
>>  | format("org.apache.spark.sql.cassandra").
>>  | options(Map( "table" -> "", "keyspace" -> "***")).
>>  | load()
>> java.util.NoSuchElementException: key not found: c_table
>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>> at org.apache.spark.sql.execution.datasources.CaseInsensitiveMa
>> p.default(ddl.scala:151)
>> at scala.collection.MapLike$class.apply(MapLike.scala:141)
>> at org.apache.spark.sql.execution.datasources.CaseInsensitiveMa
>> p.apply(ddl.scala:151)
>> at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOpt
>> ions(DefaultSource.scala:120)
>> at org.apache.spark.sql.cassandra.DefaultSource.createRelation(
>> DefaultSource.scala:56)
>> at org.apache.spark.sql.execution.datasources.ResolvedDataSourc
>> e$.apply(ResolvedDataSource.scala:125)
>> a
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: spark cassandra issue

2016-09-04 Thread Selvam Raman
its very urgent. please help me guys.

On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman  wrote:

> Please help me to solve the issue.
>
> spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.3.0
> --conf spark.cassandra.connection.host=**
>
> val df = sqlContext.read.
>  | format("org.apache.spark.sql.cassandra").
>  | options(Map( "table" -> "", "keyspace" -> "***")).
>  | load()
> java.util.NoSuchElementException: key not found: c_table
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at org.apache.spark.sql.execution.datasources.
> CaseInsensitiveMap.default(ddl.scala:151)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at org.apache.spark.sql.execution.datasources.
> CaseInsensitiveMap.apply(ddl.scala:151)
> at org.apache.spark.sql.cassandra.DefaultSource$.
> TableRefAndOptions(DefaultSource.scala:120)
> at org.apache.spark.sql.cassandra.DefaultSource.
> createRelation(DefaultSource.scala:56)
> at org.apache.spark.sql.execution.datasources.
> ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
> a
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Generating random Data using Spark and saving it to table, views appreciated

2016-09-04 Thread Mich Talebzadeh
Hi All,

The following code creates an array of certain rows in Spark and saves the
output into a Hive ORC table. You can save it in whatever format you prefer.
I wanted to create generic test data in Spark. It is not something standard
but had similar approach for Oracle.
It is a cooked up stuff and not necessarily clever. It creates columns of
data using primitive types. Anyone is welcomed to take it, modify it and
let me know if it can be bettered.
Also it uses to see whether that ORC table exists or not. It uses a
monolithically increasing ID for rows and if data is there, it starts from
MAX(ID) + 1

It would be interesting if one can add complex columns to it as well?

Appreciate any comments or help

The code attached as well

class UsedFunctions {
  import scala.util.Random
  import scala.math._
  import org.apache.spark.sql.functions._
  def randomString(chars: String, length: Int): String =
 (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString
  def clustered(id : Int, numRows: Int) : Double  = (id - 1).floor/numRows
  def scattered(id : Int, numRows: Int) : Double  = (id - 1 % numRows).abs
  def randomised(seed: Int, numRows: Int) : Double  = (Random.nextInt(seed)
% numRows).abs
  def padString(id: Int, chars: String, length: Int): String =
 (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString
+ id.toString
  def padSingleChar(chars: String, length: Int): String =
 (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString
}
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("\nStarted at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
//spark.udf.register("randomString", randomString(_:String, _:Int))
case class columns (
   id: Int
 , clustered: Double
 , scattered: Double
 , randomised: Double
 , random_string: String
 , small_vc: String
 , padding: String
 , padding2: String
   )
//val chars = ('a' to 'z') ++ ('A' to 'Z') ++ ('0' to '9') ++ ("-!£$")
val chars = ('a' to 'z') ++ ('A' to 'Z')
//
//get the max(ID) from test.dummy2
//
val numRows = 6   //do in increment of 50K rows otherwise you blow up
driver memory!
val rows = HiveContext.sql("SELECT COUNT(1) FROM
test.dummy2").collect.apply(0).getLong(0)
var start = 0
if (rows == 0) {
start = 1
} else {
   val maxID = HiveContext.sql("SELECT MAX(id) FROM
test.dummy2").collect.apply(0).getInt(0)
   start = maxID + 1
}
val end = start + numRows - 1
println (" starting at ID = " + start + " , ending on = " +  end )
val UsedFunctions = new UsedFunctions
val text = ( start to end ).map(i =>
 (
 i.toString
   , UsedFunctions.clustered(i,numRows).toString
   , UsedFunctions.scattered(i,numRows).toString
   , UsedFunctions.randomised(i,numRows).toString
   , UsedFunctions.randomString(chars.mkString(""),50)
   , UsedFunctions.padString(i, " ", 10)
   , UsedFunctions.padSingleChar("x ", 4000)
   , UsedFunctions.padSingleChar("y ", 4000)
 )
   ).
toArray
val df = sc.parallelize(text).
  map(p => columns(
  p._1.toString.toInt
, p._2.toString.toDouble
, p._3.toString.toDouble
, p._4.toString.toDouble
, p._5.toString
, p._6.toString
, p._7.toString
, p._8.toString
  )
 ).
toDF
//
// register DF as tempTable
//
df.registerTempTable("tmp")
// Need to create and populate target ORC table test.dummy2
//
//HiveContext.sql("""DROP TABLE IF EXISTS test.dummy2""")
  var sqltext  = ""
  sqltext = """
  CREATE TABLE if not exists test.dummy2(
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(4000)
   , PADDING2 VARCHAR(4000)
  )
  --CLUSTERED BY (ID) INTO 256 BUCKETS
  STORED AS ORC
  TBLPROPERTIES (
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="ID",
  "orc.bloom.filter.fpp"="0.05",
  "orc.compress"="SNAPPY",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="1" )
  """
   HiveContext.sql(sqltext)
  //
  // Put data in Hive table. Clean up is already done
  //
  sqltext = """
  INSERT INTO TABLE test.dummy2
  SELECT
  ID
, CLUSTERED
, SCATTERED
, RANDOMISED
, 

spark cassandra issue

2016-09-04 Thread Selvam Raman
Please help me to solve the issue.

spark-shell --packages
com.datastax.spark:spark-cassandra-connector_2.10:1.3.0 --conf
spark.cassandra.connection.host=**

val df = sqlContext.read.
 | format("org.apache.spark.sql.cassandra").
 | options(Map( "table" -> "", "keyspace" -> "***")).
 | load()
java.util.NoSuchElementException: key not found: c_table
at scala.collection.MapLike$class.default(MapLike.scala:228)
at
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:151)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:151)
at
org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:120)
at
org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
a

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mich Talebzadeh
You can create a monotonically incrementing ID column on your table

scala> val ll_18740868 = spark.table("accounts.ll_18740868")
scala> val startval = 1
scala> val df = ll_18740868.withColumn("id",
*monotonically_increasing_id()+* startval).show (2)
+---+---+-+-+--+---++---+---+
|transactiondate|transactiontype|
sortcode|accountnumber|transactiondescription|debitamount|creditamount|balance|
id|
+---+---+-+-+--+---++---+---+
| 2011-12-30|DEB|'30-64-72| 18740868|  WWW.GFT.COM CD
4628 |   50.0|null| 304.89|  1|
| 2011-12-30|DEB|'30-64-72| 18740868|
TDA.CONFECC.D.FRE...|  19.01|null| 354.89|  2|
+---+---+-+-+--+---++---+---+


Now you have a new ID column

HTH






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 September 2016 at 12:43, Kevin Tran  wrote:

> Hi everyone,
> Please give me your opinions on what is the best ID Generator for ID field
> in parquet ?
>
> UUID.randomUUID();
> AtomicReference currentTime = new AtomicReference<>(System.
> currentTimeMillis());
> AtomicLong counter = new AtomicLong(0);
> 
>
> Thanks,
> Kevin.
>
>
> 
> https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
> writing Parquet files)
> https://github.com/apache/spark/pull/6864/files
>


Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi everyone,
Please give me your opinions on what is the best ID Generator for ID field
in parquet ?

UUID.randomUUID();
AtomicReference currentTime = new
AtomicReference<>(System.currentTimeMillis());
AtomicLong counter = new AtomicLong(0);


Thanks,
Kevin.



https://issues.apache.org/jira/browse/SPARK-8406 (Race condition when
writing Parquet files)
https://github.com/apache/spark/pull/6864/files


How does chaining of Windowed Dstreams work?

2016-09-04 Thread Hemalatha A
Hello,

I have a set of Dstreams on which I'm  performing some computation on each
Dstreams which is widowed on one other from the base stream based on the
order of window intervals. I want to find out the best Strem on which I
could window a particular stream on?

Suppose, I have a spark Dstream, with batch interval as 10sec and other
streams are windowed on base steams as below:

*Stream*

*Window*

*Sliding*

*Windowed On*

StreamA

30

10

Base Stream

StreamB

20

20

Base Stream

StreamC

90

20

?



Now, should I base the StreamC on StreamA since its window is multiple of
StreamA or base it on  StreamB since it has a higher and same sliding
interval. Which would be a better choice?


Or is it the same as window on Base stream? How does it basically work?


-- 


Regards
Hemalatha


Creating a UDF/UDAF using code generation

2016-09-04 Thread AssafMendelson
Hi,
I want to write a UDF/UDAF which provides native processing performance. 
Currently, when creating a UDF/UDAF in a normal manner the performance is hit 
because it breaks optimizations.
I tried something like this:

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
ExprCode}
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.catalyst.util.TypeUtils
import org.apache.spark.sql.types._
import org.apache.spark.util.Utils
import org.apache.spark.sql.catalyst.expressions._

case class genf(child: Expression) extends UnaryExpression with Predicate with 
ImplicitCastInputTypes {

  override def inputTypes: Seq[AbstractDataType] = Seq(IntegerType)

  override def toString: String = s"$child < 10"

  override def eval(input: InternalRow): Any = {
val value = child.eval(input)
if (value == null)
{
  false
} else {
  child.dataType match {
case IntegerType => value.asInstanceOf[Int] < 10
  }
}
  }

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
   defineCodeGen(ctx, ev, c => s"($c) < 10")
  }
}


However, this doesn't work as some of the underlying classes/traits are private 
(e.g. AbstractDataType is private) making it problematic to create a new case 
class.
Is there a way to do it? The idea is to provide a couple of jars with a bunch 
of functions our team needs.
Thanks,
Assaf.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-UDF-UDAF-using-code-generation-tp27652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
I don’t have anything off the hand (Unfortunately I didn’t really save it) but 
you can easily make some toy examples.
For example you might do something like defining a simple UDF (e.g. test if 
number < 10)
Then create the function in scala:

package com.example
import org.apache.spark.sql.functions.udf

object udfObj extends Serializable {
  def createUDF = {
udf((x: Int) => x < 10)
  }
}

Compile the scala and run pyspark with --jars --driver-class-path on the 
created jar.
Inside pyspark do something like:


from py4j.java_gateway import java_import

from pyspark.sql.column import Column

from pyspark.sql.functions import udf

from pyspark.sql.types import BooleanType

import time



jvm = sc._gateway.jvm

java_import(jvm, "com.example")

def udf_scala(col):

return Column(jvm.com.example.udfObj.createUDF().apply(col))



udf_python = udf(lambda x: x<10, BooleanType())



df = spark.range(1000)

df.cache()

df.count()



df1 = df.filter(df.id < 10)

df2 = df.filter(udf_scala(df.id))

df3 = df.filter(udf_python(df.id))



t1 = time.time()

df1.count()

t2 = time.time()

df2.count()

t3 = time.time()

df3.count()

t4 = time.time()



print “time for builtin “ + str(t2-t1)

print “time for scala “ + str(t3-t2)

print “time for python “  + str(t4-t3)






The differences between the times should give you how long it takes (note the 
caching is done in order to make sure we don’t have issues where the range is 
created once and then reused) .
BTW, I saw this can be very touchy in terms of the cluster and its 
configuration. I ran it on two different cluster configurations and ran it 
several times to get some idea on the noise.
Of course, the more complicated the UDF, the less the overhead affects you.
Hope this helps.
Assaf









From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Sunday, September 04, 2016 11:00 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: Scala Vs Python

Hi

This one is quite interesting. Is it possible to share few toy examples?

On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson 
> wrote:
I am not aware of any official testing but you can easily create your own.
In testing I made I saw that python UDF were more than 10 times slower than 
scala UDF (and in some cases it was closer to 50 times slower).
That said, it would depend on how you use your UDF.
For example, lets say you have a 1 billion row table which you do some 
aggregation on and left with a 10K rows table. If you do the python UDF in the 
beginning then it might have a hard hit but if you do it on the 10K rows table 
then the overhead might be negligible.
Furthermore, you can always write the UDF in scala and wrap it.
This is something my team did. We have data scientists working on spark in 
python. Normally, they can use the existing functions to do what they need 
(Spark already has a pretty nice spread of functions which answer most of the 
common use cases). When they need a new UDF or UDAF they simply ask my team 
(which does the engineering) and we write them a scala one and then wrap it to 
be accessible from python.


From: ayan guha [mailto:[hidden 
email]]
Sent: Friday, September 02, 2016 12:21 AM
To: kant kodali
Cc: Mendelson, Assaf; user
Subject: Re: Scala Vs Python

Thanks All for your replies.

Feature Parity:

MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker.

Performance:
I do  get this bit when RDDs are involved, but not when Data frame is the only 
construct I am operating on.  Dataframe supposed to be language-agnostic in 
terms of performance.  So why people think python is slower? is it because of 
using UDF? Any other reason?

Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.

@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible.





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden 
email]> wrote:
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or 
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 
page essay on why that wouldn't work so great. you might be wondering why would 
spark have it then? well probably because its ease of use for ML (that would be 
my best guess).
[https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI]





On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden 
email] wrote:

I believe this would greatly depend on your use case and your familiarity with 
the languages.


Re: Scala Vs Python

2016-09-04 Thread Simon Edelhaus
Any thoughts about Spark and Erlang?


-- ttfn
Simon Edelhaus
California 2016

On Sun, Sep 4, 2016 at 1:00 AM, ayan guha  wrote:

> Hi
>
> This one is quite interesting. Is it possible to share few toy examples?
>
> On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson 
> wrote:
>
>> I am not aware of any official testing but you can easily create your own.
>>
>> In testing I made I saw that python UDF were more than 10 times slower
>> than scala UDF (and in some cases it was closer to 50 times slower).
>>
>> That said, it would depend on how you use your UDF.
>>
>> For example, lets say you have a 1 billion row table which you do some
>> aggregation on and left with a 10K rows table. If you do the python UDF in
>> the beginning then it might have a hard hit but if you do it on the 10K
>> rows table then the overhead might be negligible.
>>
>> Furthermore, you can always write the UDF in scala and wrap it.
>>
>> This is something my team did. We have data scientists working on spark
>> in python. Normally, they can use the existing functions to do what they
>> need (Spark already has a pretty nice spread of functions which answer most
>> of the common use cases). When they need a new UDF or UDAF they simply ask
>> my team (which does the engineering) and we write them a scala one and then
>> wrap it to be accessible from python.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> ]
>> *Sent:* Friday, September 02, 2016 12:21 AM
>> *To:* kant kodali
>> *Cc:* Mendelson, Assaf; user
>> *Subject:* Re: Scala Vs Python
>>
>>
>>
>> Thanks All for your replies.
>>
>>
>>
>> Feature Parity:
>>
>>
>>
>> MLLib, RDD and dataframes features are totally comparable. Streaming is
>> now at par in functionality too, I believe. However, what really worries me
>> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>>
>>
>>
>> Performance:
>>
>> I do  get this bit when RDDs are involved, but not when Data frame is the
>> only construct I am operating on.  Dataframe supposed to be
>> language-agnostic in terms of performance.  So why people think python is
>> slower? is it because of using UDF? Any other reason?
>>
>>
>>
>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
>> comparison? like the one out there  b/w RDDs.*
>>
>>
>>
>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>> applications only. I would be glad to hear your opinion on why pyspark
>> applications will not work, if you have any benchmarks please share if
>> possible.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email]
>> > wrote:
>>
>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>> write a 10 page essay on why that wouldn't work so great. you might be
>> wondering why would spark have it then? well probably because its ease of
>> use for ML (that would be my best guess).
>>
>>
>>
>>
>>
>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email]
>>  wrote:
>>
>> I believe this would greatly depend on your use case and your familiarity
>> with the languages.
>>
>>
>>
>> In general, scala would have a much better performance than python and
>> not all interfaces are available in python.
>>
>> That said, if you are planning to use dataframes without any UDF then the
>> performance hit is practically nonexistent.
>>
>> Even if you need UDF, it is possible to write those in scala and wrap
>> them for python and still get away without the performance hit.
>>
>> Python does not have interfaces for UDAFs.
>>
>>
>>
>> I believe that if you have large structured data and do not generally
>> need UDF/UDAF you can certainly work in python without losing too much.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> ]
>> *Sent:* Thursday, September 01, 2016 5:03 AM
>> *To:* user
>> *Subject:* Scala Vs Python
>>
>>
>>
>> Hi Users
>>
>>
>>
>> Thought to ask (again and again) the question: While I am building any
>> production application, should I use Scala or Python?
>>
>>
>>
>> I have read many if not most articles but all seems pre-Spark 2. Anything
>> changed with Spark 2? Either pro-scala way or pro-python way?
>>
>>
>>
>> I am thinking performance, feature parity and future direction, not so
>> much in terms of skillset or ease of use.
>>
>>
>>
>> Or, if you think it is a moot point, please say so as well.
>>
>>
>>
>> Any real life example, production experience, anecdotes, personal taste,
>> profanity all are welcome :)
>>
>>
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>>
>> --
>>
>> View this message in context: RE: Scala Vs Python
>> 

Re: Scala Vs Python

2016-09-04 Thread ayan guha
Hi

This one is quite interesting. Is it possible to share few toy examples?

On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson 
wrote:

> I am not aware of any official testing but you can easily create your own.
>
> In testing I made I saw that python UDF were more than 10 times slower
> than scala UDF (and in some cases it was closer to 50 times slower).
>
> That said, it would depend on how you use your UDF.
>
> For example, lets say you have a 1 billion row table which you do some
> aggregation on and left with a 10K rows table. If you do the python UDF in
> the beginning then it might have a hard hit but if you do it on the 10K
> rows table then the overhead might be negligible.
>
> Furthermore, you can always write the UDF in scala and wrap it.
>
> This is something my team did. We have data scientists working on spark in
> python. Normally, they can use the existing functions to do what they need
> (Spark already has a pretty nice spread of functions which answer most of
> the common use cases). When they need a new UDF or UDAF they simply ask my
> team (which does the engineering) and we write them a scala one and then
> wrap it to be accessible from python.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> ]
> *Sent:* Friday, September 02, 2016 12:21 AM
> *To:* kant kodali
> *Cc:* Mendelson, Assaf; user
> *Subject:* Re: Scala Vs Python
>
>
>
> Thanks All for your replies.
>
>
>
> Feature Parity:
>
>
>
> MLLib, RDD and dataframes features are totally comparable. Streaming is
> now at par in functionality too, I believe. However, what really worries me
> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>
>
>
> Performance:
>
> I do  get this bit when RDDs are involved, but not when Data frame is the
> only construct I am operating on.  Dataframe supposed to be
> language-agnostic in terms of performance.  So why people think python is
> slower? is it because of using UDF? Any other reason?
>
>
>
> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
> comparison? like the one out there  b/w RDDs.*
>
>
>
> @Kant:  I am not comparing ANY applications. I am comparing SPARK
> applications only. I would be glad to hear your opinion on why pyspark
> applications will not work, if you have any benchmarks please share if
> possible.
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email]
> > wrote:
>
> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases
> or Large Scale Distributed Systems makes absolutely no sense. I can write a
> 10 page essay on why that wouldn't work so great. you might be wondering
> why would spark have it then? well probably because its ease of use for ML
> (that would be my best guess).
>
>
>
>
>
> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email]
>  wrote:
>
> I believe this would greatly depend on your use case and your familiarity
> with the languages.
>
>
>
> In general, scala would have a much better performance than python and not
> all interfaces are available in python.
>
> That said, if you are planning to use dataframes without any UDF then the
> performance hit is practically nonexistent.
>
> Even if you need UDF, it is possible to write those in scala and wrap them
> for python and still get away without the performance hit.
>
> Python does not have interfaces for UDAFs.
>
>
>
> I believe that if you have large structured data and do not generally need
> UDF/UDAF you can certainly work in python without losing too much.
>
>
>
>
>
> *From:* ayan guha [mailto:[hidden email]
> ]
> *Sent:* Thursday, September 01, 2016 5:03 AM
> *To:* user
> *Subject:* Scala Vs Python
>
>
>
> Hi Users
>
>
>
> Thought to ask (again and again) the question: While I am building any
> production application, should I use Scala or Python?
>
>
>
> I have read many if not most articles but all seems pre-Spark 2. Anything
> changed with Spark 2? Either pro-scala way or pro-python way?
>
>
>
> I am thinking performance, feature parity and future direction, not so
> much in terms of skillset or ease of use.
>
>
>
> Or, if you think it is a moot point, please say so as well.
>
>
>
> Any real life example, production experience, anecdotes, personal taste,
> profanity all are welcome :)
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
> --
>
> View this message in context: RE: Scala Vs Python
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
> --
> View this message in context: RE: Scala Vs Python
> 

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
I am not aware of any official testing but you can easily create your own.
In testing I made I saw that python UDF were more than 10 times slower than 
scala UDF (and in some cases it was closer to 50 times slower).
That said, it would depend on how you use your UDF.
For example, lets say you have a 1 billion row table which you do some 
aggregation on and left with a 10K rows table. If you do the python UDF in the 
beginning then it might have a hard hit but if you do it on the 10K rows table 
then the overhead might be negligible.
Furthermore, you can always write the UDF in scala and wrap it.
This is something my team did. We have data scientists working on spark in 
python. Normally, they can use the existing functions to do what they need 
(Spark already has a pretty nice spread of functions which answer most of the 
common use cases). When they need a new UDF or UDAF they simply ask my team 
(which does the engineering) and we write them a scala one and then wrap it to 
be accessible from python.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Friday, September 02, 2016 12:21 AM
To: kant kodali
Cc: Mendelson, Assaf; user
Subject: Re: Scala Vs Python

Thanks All for your replies.

Feature Parity:

MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker.

Performance:
I do  get this bit when RDDs are involved, but not when Data frame is the only 
construct I am operating on.  Dataframe supposed to be language-agnostic in 
terms of performance.  So why people think python is slower? is it because of 
using UDF? Any other reason?

Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.

@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible.





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali 
> wrote:
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or 
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 
page essay on why that wouldn't work so great. you might be wondering why would 
spark have it then? well probably because its ease of use for ML (that would be 
my best guess).
[https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI]





On Wed, Aug 31, 2016 11:45 PM, AssafMendelson 
assaf.mendel...@rsa.com wrote:

I believe this would greatly depend on your use case and your familiarity with 
the languages.



In general, scala would have a much better performance than python and not all 
interfaces are available in python.

That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.

Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.

Python does not have interfaces for UDAFs.



I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.





From: ayan guha [mailto:[hidden 
email]]
Sent: Thursday, September 01, 2016 5:03 AM
To: user
Subject: Scala Vs Python



Hi Users



Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?



I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?



I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.



Or, if you think it is a moot point, please say so as well.



Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)



--

Best Regards,
Ayan Guha


View this message in context: RE: Scala Vs 
Python
Sent from the Apache Spark User List mailing list 
archive at Nabble.com.



--
Best Regards,
Ayan Guha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Creating RDD using swebhdfs with truststore

2016-09-04 Thread Denis Bolshakov
Hello,

I would also set java opts for driver.

Best regards,
Denis

4 Сен 2016 г. 0:31 пользователь "Sourav Mazumder" <
sourav.mazumde...@gmail.com> написал:

> Hi,
>
> I am trying to create a RDD by using swebhdfs to a remote hadoop cluster
> which is protected by Knox and uses SSL.
>
> The code looks like this -
>
> sc.textFile("swebhdfs:/host:port/gateway/default/webhdfs/
> v1/").count.
>
> I'm passing the truststore and trustorepassword through extra java options
> while starting the spark shell as -
>
> spark-shell --conf 
> "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=truststor.jks
> -Djavax.net.ssl.trustStorePassword=" --conf "spark.driver.
> extraJavaOptions=-Djavax.net.ssl.trustStore=truststore.jks
> -Djavax.net.ssl.trustStorePassword="
>
> But I'm always getting the error that -
>
> Name: javax.net.ssl.SSLHandshakeException
> Message: Remote host closed connection during handshake
>
> Am I passing the truststore and truststore password in right way ?
>
> Regards,
>
> Sourav
>
>