ECOS Spark Integration

2017-12-17 Thread Debasish Das
Hi,

ECOS is a solver for second order conic programs and we showed the Spark
integration at 2014 Spark Summit
https://spark-summit.org/2014/quadratic-programing-solver-for-non-negative-matrix-factorization/.
Right now the examples show how to reformulate matrix factorization as a
SOCP and solve each alternating steps with ECOS:

https://github.com/embotech/ecos-java-scala

For distributed optimization, I expect it will be useful where for each
primary row key (sensor, car, robot :-) we are fitting a constrained
quadratic / cone program. Please try it out and let me know the feedbacks.

Thanks.
Deb


Re: flatMap() returning large class

2017-12-17 Thread Don Drake
Hey Richard,

Good to hear from you as well.  I thought I would ask if there was
something Scala specific I was missing in handling these large classes.

I can tweak my job to do a map() and then only one large object will be
created at a time and returned, which should allow me to lower my executor
memory size.

Thanks.

-Don


On Thu, Dec 14, 2017 at 2:58 PM, Richard Garris 
wrote:

> Hi Don,
>
> Good to hear from you. I think the problem is that regardless of whether
> you use yield or a generator - Spark internally will produce the entire
> result as a single large JVM object which will blow up your heap space.
>
> Would it be possible to shrink the overall size of the image object
> storing it as a vector or Array vs a large Java class object?
>
> That might be the more prudent approach.
>
> -RG
>
> Richard Garris
>
> Principal Architect
>
> Databricks, Inc
>
> 650.200.0840 <(650)%20200-0840>
>
> rlgar...@databricks.com
>
> On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com)
> wrote:
>
> This sounds like something mapPartitions should be able to do, not
> sure if there's an easier way.
>
> On Thu, Dec 14, 2017 at 10:20 AM, Don Drake  wrote:
> > I'm looking for some advice when I have a flatMap on a Dataset that is
> > creating and returning a sequence of a new case class
> > (Seq[BigDataStructure]) that contains a very large amount of data, much
> > larger than the single input record (think images).
> >
> > In python, you can use generators (yield) to bypass creating a large
> list of
> > structures and returning the list.
> >
> > I'm programming this is in Scala and was wondering if there are any
> similar
> > tricks to optimally return a list of classes?? I found the for/yield
> > semantics, but it appears the compiler is just creating a sequence for
> you
> > and this will blow through my Heap given the number of elements in the
> list
> > and the size of each element.
> >
> > Is there anything else I can use?
> >
> > Thanks.
> >
> > -Don
> >
> > --
> > Donald Drake
> > Drake Consulting
> > http://www.drakeconsulting.com/
> > https://twitter.com/dondrake
> > 800-733-2143 <(800)%20733-2143>
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Re: How to...UNION ALL of two SELECTs over different data sources in parallel?

2017-12-17 Thread Jacek Laskowski
Thanks Silvio!

In the meantime, with help of Adam and code review of WholeStageCodegenExec
and CollapseCodegenStages, I found out that anything that's codegend is as
fast as the tasks in a stage. In this case, union of two codegend subtrees
is indeed parallel.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Sat, Dec 16, 2017 at 7:12 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Hi Jacek,
>
>
>
> Just replied to the SO thread as well, but…
>
>
>
> Yes, your first statement is correct. The DFs in the union are read in the
> same stage, so in your example where each DF has 8 partitions then you have
> a stage with 16 tasks to read the 2 DFs. There's no need to define the DF
> in a separate thread. You can verify this also in the Stage UI and looking
> at the Event Timeline. You should see the tasks across the DFs executing in
> parallel as expected.
>
>
>
> Here’s the UI for the following example, in which case each DF only has 1
> partition (so we get a stage with 2 tasks):
>
>
>
> spark.range(1, 100, 1, 1).write.save("/tmp/df1")
>
> spark.range(101, 200, 1, 1).write.save("/tmp/df2")
>
>
>
> spark.read.load("/tmp/df1").union(spark.read.load("/tmp/df2")).foreach {
> _ => }
>
>
>
>
>
> *From: *Jacek Laskowski 
> *Date: *Saturday, December 16, 2017 at 6:40 AM
> *To: *"user @spark" 
> *Subject: *How to...UNION ALL of two SELECTs over different data sources
> in parallel?
>
>
>
> Hi,
>
>
>
> I've been trying to find out the answer to the question about UNION ALL
> and SELECTs @ https://stackoverflow.com/q/47837955/1305344
>
>
>
> > If I have Spark SQL statement of the form SELECT [...] UNION ALL SELECT
> [...], will the two SELECT statements be executed in parallel? In my
> specific use case the two SELECTs are querying two different database
> tables. In contrast to what I would have expected, the Spark UI seems to
> suggest that the two SELECT statements are performed sequentially.
>
>
>
> How to know if the two separate SELECTs are executed in parallel or not?
> What are the tools to know it?
>
>
>
> My answer was to use explain operator that would show...well...physical
> plan, but am not sure how to read it to know whether a query plan is going
> to be executed in parallel or not.
>
>
>
> I then used the underlying RDD lineage (using rdd.toDebugString) hoping
> that gives me the answer, but...I'm not so sure.
>
>
>
> For a query like the following:
>
>
>
> val q = spark.range(1).union(spark.range(2))
>
>
>
> I thought that since both SELECTs are codegen'ed they could be executed in
> parallel, but when switched to the RDD lineage I lost my confidence given
> there's just one single stage (!)
>
>
>
> scala> q.rdd.toDebugString
>
> res4: String =
>
> (16) MapPartitionsRDD[17] at rdd at :26 []
>
>  |   MapPartitionsRDD[16] at rdd at :26 []
>
>  |   UnionRDD[15] at rdd at :26 []
>
>  |   MapPartitionsRDD[11] at rdd at :26 []
>
>  |   MapPartitionsRDD[10] at rdd at :26 []
>
>  |   ParallelCollectionRDD[9] at rdd at :26 []
>
>  |   MapPartitionsRDD[14] at rdd at :26 []
>
>  |   MapPartitionsRDD[13] at rdd at :26 []
>
>  |   ParallelCollectionRDD[12] at rdd at :26 []
>
>
>
> What am I missing and how to be certain whether and what parts of a query
> are going to be executed in parallel?
>
>
>
> Please help...
>
>
>
> Pozdrawiam,
>
> Jacek Laskowski
>
> 
>
> https://about.me/JacekLaskowski
>
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>
> Follow me at https://twitter.com/jaceklaskowski
>


spark + SalesForce SSL HandShake Issue

2017-12-17 Thread kali.tumm...@gmail.com
Hi All, 

I was trying out spark + SalesforceLibabry on cloudera 5.9 I am having SSL
handhshake issue please check out my question on stack over flow no one
answered.

The library works ok on windows it fails when I try to run on cloudera edge
node.

https://stackoverflow.com/questions/47820372/ssl-handshake-issue-on-cloudera-5-9

Did  anyone tried out these spark + salesforce library (
https://github.com/springml/spark-salesforce) +
(https://github.com/springml/salesforce-wave-api).


I am trying to use sales force wave api library
(https://github.com/springml/salesforce-wave-api) in cloudera cluster 5.9 to
get data, I have to use proxy because from our cluster its the only way to
communicate outside world. So I made changes to the library to take proxy
host and port to communicate below are the place where I made changes.

Change 1:-

config.setProxy("myenterpiseproxy server",port);
https://github.com/springml/salesforce-wave-api/blob/0ac76aeb2221d9e7038229fd352a8694e8cde7e9/src/main/java/com/springml/salesforce/wave/util/SFConfig.java#L101

Change 2:-

HttpHost proxy = new HttpHost("myenterpiseproxy server", port, "http");
https://github.com/springml/salesforce-wave-api/blob/0ac76aeb2221d9e7038229fd352a8694e8cde7e9/src/main/java/com/springml/salesforce/wave/util/HTTPHelper.java#L127

Change 3:-

RequestConfig requestConfig =
RequestConfig.custom().setSocketTimeout(timeout)
   
.setConnectTimeout(timeout).setConnectionRequestTimeout(timeout).setProxy(proxy).build();
https://github.com/springml/salesforce-wave-api/blob/0ac76aeb2221d9e7038229fd352a8694e8cde7e9/src/main/java/com/springml/salesforce/wave/util/HTTPHelper.java#L129

I built an application to use salesforce wave api as dependency and I tried
to execute the Jar I am getting SSL handshake issue.

I passed in javax.net.ssl.truststore,javax.net.ssl.keyStore,https.protocols
of my cluster still having problems.

Did anyone had similar issue ? did anyone tried to use this library in
cloudera cluster ?

Run Book:-

java -cp
httpclient-4.5.jar:SFWaveApiTest-1.0-SNAPSHOT-jar-with-dependencies.jar
com.az.sfget.SFGetTest "username" "passwordwithtoken"
"https://test.salesforce.com/services/Soap/u/35"; "select id,OWNERID from
someobject" "enterpiseproxyhost" "9400" "TLSv1.1,TLSv1.2"
"/usr/java/jdk1.7.0_67-cloudera/jre/lib/security/jssecacerts"
"/opt/cloudera/security/jks/uscvlpcldra-keystore.jks"
Error:-

javax.net.ssl.SSLHandshakeException: Remote host closed connection during
handshake
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:946)
at
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1312)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1339)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1323)
at
org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:394)
at
org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:353)
at
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
at
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at
org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:388)
at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at
com.springml.salesforce.wave.util.HTTPHelper.execute(HTTPHelper.java:122)
at com.springml.salesforce.wave.util.HTTPHelper.get(HTTPHelper.java:88)
at com.springml.salesforce.wave.util.HTTPHelper.get(HTTPHelper.java:92)
at
com.springml.salesforce.wave.impl.ForceAPIImpl.query(ForceAPIImpl.java:120)
at
com.springml.salesforce.wave.impl.ForceAPIImpl.query(ForceAPIImpl.java:36)
at com.az.sfget.SFGetTest.main(SFGetTest.java:54)
Caused by: java.io.EOFException: SSL peer shut down incorrectly
at sun.security.ssl.InputRecord.read(InputRecord.java:482)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
... 21 more


Thanks
Sri 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Lucene Index with Spark Cassandra

2017-12-17 Thread Junaid Nasir
Hi everyone,
I am trying to run lucene with spark but sparkSQL returns zero results, where
when same query is run using cqlsh it returns correct rows. same issue as  
https://github.com/Stratio/cassandra-lucene-index/issues/79 I can see in spark
logs that lucene is working but as mentioned in the link spark is filtering
those results afterwards. any help to prevent this would be highly
appreciated.
Using datastax:spark-cassandra-connector:2.0.3-s_2.11spark:  2.1.1
cassandra:3.11.0lucene: 3.11.0.0
I am using the old syntax as mentioned in lucene docs i.ecreated a dummy col  
C* schema and Lucene index   CREATE TABLE alldev.temp (
devid text,day date,datetime timestamp,lucene text,value text,  
  PRIMARY KEY ((devid, day), datetime))CREATE CUSTOM INDEX idx ON alldev.temp 
(lucene)   USING 'com.stratio.cassandra.lucene.Index'  WITH OPTIONS = {
'refresh_seconds': '1','schema': '{   fields: { devid: {type: 
"string"}, day:{type: "date", pattern: "-MM-dd"}, 
datetime:{type: "date"}, value:{type:"integer"}}}'  };
cqlsh> select * from alldev.temp  where lucene =  '{filter: {type: "range", 
field: "value", lower: "0"}}' ;


Not using Mixmax yet?  


Thanks for your time,Junaid