Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are
having a hard time managing  the memory.

Thanks
Best Regards

On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:

> [adding dev list since it's probably a bug, but i'm not sure how to
> reproduce so I can open a bug about it]
>
> Hi,
>
> I have a standalone Spark 1.4.0 cluster with 100s of applications running
> every day.
>
> From time to time, the applications crash with the following error (see
> below)
> But at the same time (and also after that), other applications are
> running, so I can safely assume the master and workers are working.
>
> 1. why is there a NullPointerException? (i can't track the scala stack
> trace to the code, but anyway NPE is usually a obvious bug even if there's
> actually a network error...)
> 2. why can't it connect to the master? (if it's a network timeout, how to
> increase it? i see the values are hardcoded inside AppClient)
> 3. how to recover from this error?
>
>
>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
> logs/error.log
>   java.lang.NullPointerException NullPointerException
>   at
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at
> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>   ERROR 01-11 15:32:55,603   SparkContext - Error
> initializing SparkContext. ERROR
>   java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext
>   at org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>   at
> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>   at
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>   at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>
>
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>


Re: Guidance to get started

2015-11-09 Thread Akhil Das
You can read the installation details from here
http://spark.apache.org/docs/latest/

You can read about contributing to spark from here
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Thanks
Best Regards

On Thu, Oct 29, 2015 at 3:53 PM, Aaska Shah  wrote:

> Hello,my name is Aaska Shah and I am a second year undergrad student at
> DAIICT,Gandhinagar,India.
>
> I have quite lately been interested in contributing towards the open
> source organization and I find your organization the most appropriate one.
>
> I request you to please guide me through how to install your codebase and
> how to get started to your organization.
>
> Thanking You,
> Aaska Shah
>


?????? [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-09 Thread Ricky
Now I try the spark-1.5.2-rc2.zip from githup ,the result also has  errors .


[root@ouyangshourui spark-1.5.2-rc2]# pwd
/SparkCode/spark-1.5.2-rc2
[root@ouyangshourui spark-1.5.2-rc2]# nohup   mvn -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package &



The error log as following:


[INFO] Building jar: 
/SparkCode/spark-1.5.2-rc2/external/flume-assembly/target/spark-streaming-flume-assembly_2.10-1.5.2-test-sources.jar
[INFO] 
[INFO] 
[INFO] Building Spark Project External MQTT 1.5.2
[INFO] 
Downloading: 
http://maven.oschina.net/content/groups/public/org/eclipse/paho/org.eclipse.paho.client.mqttv3/1.0.1/org.eclipse.paho.client.mqttv3-1.0.1.pom
[WARNING] The POM for org.eclipse.paho:org.eclipse.paho.client.mqttv3:jar:1.0.1 
is missing, no dependency information available
Downloading: 
http://maven.oschina.net/content/groups/public/org/eclipse/paho/org.eclipse.paho.client.mqttv3/1.0.1/org.eclipse.paho.client.mqttv3-1.0.1.jar
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 13.576 s]
[INFO] Spark Project Launcher . SUCCESS [ 19.966 s]
[INFO] Spark Project Networking ... SUCCESS [ 11.279 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.353 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 20.199 s]
[INFO] Spark Project Core . SUCCESS [04:18 min]
[INFO] Spark Project Bagel  SUCCESS [ 27.070 s]
[INFO] Spark Project GraphX ... SUCCESS [01:09 min]
[INFO] Spark Project Streaming  SUCCESS [01:57 min]
[INFO] Spark Project Catalyst . SUCCESS [02:21 min]
[INFO] Spark Project SQL .. SUCCESS [02:50 min]
[INFO] Spark Project ML Library ... SUCCESS [03:01 min]
[INFO] Spark Project Tools  SUCCESS [ 13.731 s]
[INFO] Spark Project Hive . SUCCESS [02:06 min]
[INFO] Spark Project REPL . SUCCESS [ 42.023 s]
[INFO] Spark Project YARN . SUCCESS [ 56.501 s]
[INFO] Spark Project Hive Thrift Server ... SUCCESS [ 53.986 s]
[INFO] Spark Project Assembly . SUCCESS [01:58 min]
[INFO] Spark Project External Twitter . SUCCESS [ 18.626 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 34.569 s]
[INFO] Spark Project External Flume ... SUCCESS [ 29.643 s]
[INFO] Spark Project External Flume Assembly .. SUCCESS [  4.430 s]
[INFO] Spark Project External MQTT  FAILURE [  5.822 s]
[INFO] Spark Project External MQTT Assembly ... SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project External Kafka ... SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 25:41 min
[INFO] Finished at: 2015-11-09T09:45:27-05:00
[INFO] Final Memory: 73M/1041M
[INFO] 
[ERROR] Failed to execute goal on project spark-streaming-mqtt_2.10: Could not 
resolve dependencies for project 
org.apache.spark:spark-streaming-mqtt_2.10:jar:1.5.2: Could not find artifact 
org.eclipse.paho:org.eclipse.paho.client.mqttv3:jar:1.0.1 in nexus-osc 
(http://maven.oschina.net/content/groups/public/) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-streaming-mqtt_2.10






Best Regards 

Ricky Yang()







 





--  --
??: "Sean Owen";;
: 2015??11??9??(??) 2:39
??: "Ricky"<494165...@qq.com>; 
: 

Re: sample or takeSample or ??

2015-11-09 Thread Akhil Das
You can't create a new RDD by selecting few elements. A rdd.take(n),
takeSample etc are actions and it will trigger your entire pipeline to be
executed.
You can although do something like this i guess:

val sample_data = rdd.take(10)

val sample_rdd = sc.parallelize(sample_data)



Thanks
Best Regards

On Thu, Oct 29, 2015 at 10:45 AM, 张志强(旺轩)  wrote:

> How do I to get a NEW RDD that has a number of elements that I specified?
> Sample()? It has no the number parameter, takeSample() it returns as a list?
>
>
>
> Help, please.
>


Support for views/ virtual tables in SparkSQL

2015-11-09 Thread Sudhir Menon
Team:

Do we plan to add support for views/ virtual tables in SparkSQL anytime
soon?
Trying to run the TPC-H workload and failing on queries that assumes
support for views in the underlying database

Thanks in advance

Suds


?????? [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-09 Thread Ricky
thank for your help,do as you said,the problem is firewall issues,when changing 
to  maven default repo (http://repo1.maven.org) from  http://maven.oschina.net, 
spark-streaming-mqtt_2.10 module  compiled Successfully.



--

Best Regards 











 




--  --
??: "";<494165...@qq.com>;
: 2015??11??9??(??) 10:57
??: "Sean Owen"; "Ted Yu"; 
"Krishna Sankar"; 
: "Denny Lee"; "Mark 
Hamstra"; "Reynold Xin"; 
"dev@spark.apache.org"; 
: ?? [VOTE] Release Apache Spark 1.5.2 (RC2)



Now I try the spark-1.5.2-rc2.zip from githup ,the result also has  errors .


[root@ouyangshourui spark-1.5.2-rc2]# pwd
/SparkCode/spark-1.5.2-rc2
[root@ouyangshourui spark-1.5.2-rc2]# nohup   mvn -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package &



The error log as following:


[INFO] Building jar: 
/SparkCode/spark-1.5.2-rc2/external/flume-assembly/target/spark-streaming-flume-assembly_2.10-1.5.2-test-sources.jar
[INFO] 
[INFO] 
[INFO] Building Spark Project External MQTT 1.5.2
[INFO] 
Downloading: 
http://maven.oschina.net/content/groups/public/org/eclipse/paho/org.eclipse.paho.client.mqttv3/1.0.1/org.eclipse.paho.client.mqttv3-1.0.1.pom
[WARNING] The POM for org.eclipse.paho:org.eclipse.paho.client.mqttv3:jar:1.0.1 
is missing, no dependency information available
Downloading: 
http://maven.oschina.net/content/groups/public/org/eclipse/paho/org.eclipse.paho.client.mqttv3/1.0.1/org.eclipse.paho.client.mqttv3-1.0.1.jar
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 13.576 s]
[INFO] Spark Project Launcher . SUCCESS [ 19.966 s]
[INFO] Spark Project Networking ... SUCCESS [ 11.279 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.353 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 20.199 s]
[INFO] Spark Project Core . SUCCESS [04:18 min]
[INFO] Spark Project Bagel  SUCCESS [ 27.070 s]
[INFO] Spark Project GraphX ... SUCCESS [01:09 min]
[INFO] Spark Project Streaming  SUCCESS [01:57 min]
[INFO] Spark Project Catalyst . SUCCESS [02:21 min]
[INFO] Spark Project SQL .. SUCCESS [02:50 min]
[INFO] Spark Project ML Library ... SUCCESS [03:01 min]
[INFO] Spark Project Tools  SUCCESS [ 13.731 s]
[INFO] Spark Project Hive . SUCCESS [02:06 min]
[INFO] Spark Project REPL . SUCCESS [ 42.023 s]
[INFO] Spark Project YARN . SUCCESS [ 56.501 s]
[INFO] Spark Project Hive Thrift Server ... SUCCESS [ 53.986 s]
[INFO] Spark Project Assembly . SUCCESS [01:58 min]
[INFO] Spark Project External Twitter . SUCCESS [ 18.626 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 34.569 s]
[INFO] Spark Project External Flume ... SUCCESS [ 29.643 s]
[INFO] Spark Project External Flume Assembly .. SUCCESS [  4.430 s]
[INFO] Spark Project External MQTT  FAILURE [  5.822 s]
[INFO] Spark Project External MQTT Assembly ... SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project External Kafka ... SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 25:41 min
[INFO] Finished at: 2015-11-09T09:45:27-05:00
[INFO] Final Memory: 73M/1041M
[INFO] 
[ERROR] Failed to execute goal on project spark-streaming-mqtt_2.10: Could not 
resolve dependencies for project 
org.apache.spark:spark-streaming-mqtt_2.10:jar:1.5.2: Could not find artifact 
org.eclipse.paho:org.eclipse.paho.client.mqttv3:jar:1.0.1 in nexus-osc 
(http://maven.oschina.net/content/groups/public/) -> [Help 1]
[ERROR] 

Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread tsh

Hi,

I'm in the same position right now: we are going to implement something 
like OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes 
of versatile data and the necessity to make something like cubes (Hive 
and Hive on HBase are unsatisfactory). From the other, our users get 
accustomed to Tableau + Vertica.

So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some 
storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + 
Flume (has somebody use it in production?)

5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly 
configurable, you'll have to dedicate special employee to support it.


I'll be glad to hear other ideas & propositions as we are at the 
beginning of the process too.


Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightf...@163.com wrote:

Hi,

Thanks for suggesting. Actually we are now evaluating and stressing 
the spark sql on cassandra, while


trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP


cube engine, right ? So we are hesitating on the common sense or 
direction choice of olap architecture.


And we are happy to hear more use case from this community.

Best,
Sun.


fightf...@163.com

*From:* Jörn Franke 
*Date:* 2015-11-09 14:40
*To:* fightf...@163.com 
*CC:* user ; dev

*Subject:* Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in
combination? If no and your core business is not software then you
may want to look for something else, because it might not make
sense to build up internal know-how in all of these areas.

In any case - it depends all highly on your data and queries. You
will have to do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com
" > wrote:


Hi, community

We are specially interested about this featural integration
according to some slides from [1]. The
SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the
open-source world, especially non-hadoop based cluster
environment. As we can see,

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which
can also make a perfect complement for Apache Cassandra native
cql feature.

2 both streaming and batch process availability using the
ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with
cassandra, including seemlessly integrating with job scheduling
and resource management.

Only one concern goes to the OLAP query performance issue, which
mainly caused by frequent aggregation work between daily
increased large tables, for

both spark sql and cassandra. I can see that the [1] use case
facilitates FiloDB to achieve columnar storage and query
performance, but we had nothing more

knowledge.

Question is : Any guy had such use case for now, especially using
in your production environment ? Would be interested in your
architeture for designing this

OLAP engine using spark +  cassandra. What do you think the
comparison between the scenario with traditional OLAP cube
design? Like Apache Kylin or

pentaho mondrian ?

Best Regards,

Sun.


[1]

http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark


fightf...@163.com 






Re: Sort Merge Join from the filesystem

2015-11-09 Thread Alex Nastetsky
Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset
A which is already partitioned/sorted on disk and dataset B, which gets
generated during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to
be performed on it, using the same partitioner that was used with dataset
A. Then dataset B could be joined with dataset A without needing to write
it to disk first (unless it's too big to fit in memory, then it would need
to be [partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao  wrote:

> Yes, we probably need more change for the data source API if we need to
> implement it in a generic way.
>
> BTW, I create the JIRA by copy most of words from Alex. J
>
>
>
> https://issues.apache.org/jira/browse/SPARK-11512
>
>
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Thursday, November 5, 2015 1:36 AM
> *To:* Alex Nastetsky
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> It's not supported yet, and not sure if there is a ticket for it. I don't
> think there is anything fundamentally hard here either.
>
>
>
>
>
> On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
> (this is kind of a cross-post from the user list)
>
>
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
>
>
> This functionality exists in
>
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
>
> - Pig (USING 'merge')
>
> - MapReduce (CompositeInputFormat)
>
>
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
>
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
>
>
> Thanks.
>
>
>


Re: Block Transfer Service encryption support

2015-11-09 Thread turp1twin
I created a pull request for issue  SPARK-6373
   Any feedback would
be appreciated... https://github.com/apache/spark/pull/9416


Jeff




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934p15098.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
If they have a problem managing memory, wouldn't there should be a OOM?
Why does AppClient throw a NPE?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das 
wrote:

> Is that all you have in the executor logs? I suspect some of those jobs
> are having a hard time managing  the memory.
>
> Thanks
> Best Regards
>
> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:
>
>> [adding dev list since it's probably a bug, but i'm not sure how to
>> reproduce so I can open a bug about it]
>>
>> Hi,
>>
>> I have a standalone Spark 1.4.0 cluster with 100s of applications running
>> every day.
>>
>> From time to time, the applications crash with the following error (see
>> below)
>> But at the same time (and also after that), other applications are
>> running, so I can safely assume the master and workers are working.
>>
>> 1. why is there a NullPointerException? (i can't track the scala stack
>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>> actually a network error...)
>> 2. why can't it connect to the master? (if it's a network timeout, how to
>> increase it? i see the values are hardcoded inside AppClient)
>> 3. how to recover from this error?
>>
>>
>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>> logs/error.log
>>   java.lang.NullPointerException NullPointerException
>>   at
>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>   at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>   at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>   at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>   at
>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>   at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>   at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>   at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>   at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>   at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>> initializing SparkContext. ERROR
>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>>   at org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>   at
>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>   at
>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>   at
>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>
>>
>> Thanks!
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Did you find anything regarding the OOM in the executor logs?

Thanks
Best Regards

On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman  wrote:

> If they have a problem managing memory, wouldn't there should be a OOM?
> Why does AppClient throw a NPE?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das 
> wrote:
>
>> Is that all you have in the executor logs? I suspect some of those jobs
>> are having a hard time managing  the memory.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:
>>
>>> [adding dev list since it's probably a bug, but i'm not sure how to
>>> reproduce so I can open a bug about it]
>>>
>>> Hi,
>>>
>>> I have a standalone Spark 1.4.0 cluster with 100s of applications
>>> running every day.
>>>
>>> From time to time, the applications crash with the following error (see
>>> below)
>>> But at the same time (and also after that), other applications are
>>> running, so I can safely assume the master and workers are working.
>>>
>>> 1. why is there a NullPointerException? (i can't track the scala stack
>>> trace to the code, but anyway NPE is usually a obvious bug even if there's
>>> actually a network error...)
>>> 2. why can't it connect to the master? (if it's a network timeout, how
>>> to increase it? i see the values are hardcoded inside AppClient)
>>> 3. how to recover from this error?
>>>
>>>
>>>   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
>>> has been killed. Reason: All masters are unresponsive! Giving up. ERROR
>>>   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
>>> logs/error.log
>>>   java.lang.NullPointerException NullPointerException
>>>   at
>>> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>   at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>>   at
>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>   at
>>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>   at
>>> org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
>>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>   at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>   at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>   ERROR 01-11 15:32:55,603   SparkContext - Error
>>> initializing SparkContext. ERROR
>>>   java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>>   at org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>   at
>>> org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
>>>   at
>>> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
>>>   at org.apache.spark.SparkContext.(SparkContext.scala:543)
>>>   at
>>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>>>
>>>
>>> Thanks!
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>
>>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
I didn't see anything about a OOM.
This happens sometimes before anything in the application happened, and
happens to a few applications at the same time - so I guess it's a
communication failure, but the problem is that the error shown doesn't
represent the actual problem (which may be a network timeout etc)

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das 
wrote:

> Did you find anything regarding the OOM in the executor logs?
>
> Thanks
> Best Regards
>
> On Mon, Nov 9, 2015 at 8:44 PM, Romi Kuntsman  wrote:
>
>> If they have a problem managing memory, wouldn't there should be a OOM?
>> Why does AppClient throw a NPE?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das 
>> wrote:
>>
>>> Is that all you have in the executor logs? I suspect some of those jobs
>>> are having a hard time managing  the memory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman  wrote:
>>>
 [adding dev list since it's probably a bug, but i'm not sure how to
 reproduce so I can open a bug about it]

 Hi,

 I have a standalone Spark 1.4.0 cluster with 100s of applications
 running every day.

 From time to time, the applications crash with the following error (see
 below)
 But at the same time (and also after that), other applications are
 running, so I can safely assume the master and workers are working.

 1. why is there a NullPointerException? (i can't track the scala stack
 trace to the code, but anyway NPE is usually a obvious bug even if there's
 actually a network error...)
 2. why can't it connect to the master? (if it's a network timeout, how
 to increase it? i see the values are hardcoded inside AppClient)
 3. how to recover from this error?


   ERROR 01-11 15:32:54,991SparkDeploySchedulerBackend - Application
 has been killed. Reason: All masters are unresponsive! Giving up. ERROR
   ERROR 01-11 15:32:55,087  OneForOneStrategy - ERROR
 logs/error.log
   java.lang.NullPointerException NullPointerException
   at
 org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
   at
 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at
 org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
   ERROR 01-11 15:32:55,603   SparkContext - Error
 initializing SparkContext. ERROR
   java.lang.IllegalStateException: Cannot call methods on a stopped
 SparkContext
   at org.apache.spark.SparkContext.org
 $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
   at
 org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
   at
 org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
   at org.apache.spark.SparkContext.(SparkContext.scala:543)
   at
 org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)


 Thanks!

 *Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

>>>
>>>
>>
>


Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Tim Preece
Searching shows several people hit this same NPE in AppClient.scala line 160
( perhaps because appID was null - could  application had be stopped before
registered ?) 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Some-spark-apps-fail-with-All-masters-are-unresponsive-while-others-pass-normally-tp14858p15096.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread Luke Han
Some friends refer me this thread about OLAP/Kylin and Spark...

Here's my 2 cents..

If you are trying to setup OLAP, Apache Kylin should be one good idea for
you to evaluate.

The project has developed more than 2 years and going to graduate to Apache
Top Level Project [1].
There are many deployments on production already include eBay, Exponential,
JD.com, VIP.com and others, refer to powered by page [2].

Apache Kylin's spark engine also on the way, there's discussion about
turning the performance [3].

There are variety clients are available to interactive with Kylin with ANSI
SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the
Excel/PowerBI support will roll out this week.

Apache Kylin is young but mature with huge case validation (one biggest
cube in eBay contains 85+B rows, 90%ile production platform's query latency
in few seconds).

StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
already one real case on production inside eBay, please refer to our design
deck [4]

We are really welcome everyone to join and contribute to Kylin as OLAP
engine for Big Data:-)

Please feel free to contact our community or me for any question.

Thanks.

1. http://s.apache.org/bah
2. http://kylin.incubator.apache.org/community/poweredby.html
3. http://s.apache.org/lHA
4.
http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
5. http://kylin.io


Best Regards!
-

Luke Han

On Tue, Nov 10, 2015 at 2:56 AM, tsh  wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* Jörn Franke 
> *Date:* 2015-11-09 14:40
> *To:* fightf...@163.com
> *CC:* user ; dev 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had 

Re: Support for views/ virtual tables in SparkSQL

2015-11-09 Thread Zhan Zhang
I think you can rewrite those TPC-H queries not using view, for example 
registerTempTable

Thanks.

Zhan Zhang

On Nov 9, 2015, at 9:34 PM, Sudhir Menon  wrote:

> Team:
> 
> Do we plan to add support for views/ virtual tables in SparkSQL anytime soon?
> Trying to run the TPC-H workload and failing on queries that assumes support 
> for views in the underlying database
> 
> Thanks in advance
> 
> Suds


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Sean Owen
Since it's a fairly expensive operation to build the Map, I tend to agree
it should not happen in the loop.

On Tue, Nov 10, 2015 at 5:08 AM, Yuming Wang  wrote:

> Hi
>
>
>
> I found org.apache.spark.ml.feature.Word2Vec.transform() very slow.
>
> I think we should not read broadcast every sentence, so I fixed on my forked.
>
>
>
> https://github.com/979969786/spark/commit/a9f894df3671bb8df2f342de1820dab3185598f3
>
>
>
> I have use 2 number rows test it. Original version consume *5 minutes*,
>
>
> ​
>
> and my version just consume *22 seconds* on same data.
>
>
> ​
>
>
>
>
> If I'm right, I will pull request.
>
>
>
> Thanks
>
>


Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Nick Pentreath
Seems a straightforward change that purely enhances efficiency, so yes
please submit a JIRA and PR for this

On Tue, Nov 10, 2015 at 8:56 AM, Sean Owen  wrote:

> Since it's a fairly expensive operation to build the Map, I tend to agree
> it should not happen in the loop.
>
> On Tue, Nov 10, 2015 at 5:08 AM, Yuming Wang  wrote:
>
>> Hi
>>
>>
>>
>> I found org.apache.spark.ml.feature.Word2Vec.transform() very slow.
>>
>> I think we should not read broadcast every sentence, so I fixed on my forked.
>>
>>
>>
>> https://github.com/979969786/spark/commit/a9f894df3671bb8df2f342de1820dab3185598f3
>>
>>
>>
>> I have use 2 number rows test it. Original version consume *5 minutes*,
>>
>>
>> ​
>>
>> and my version just consume *22 seconds* on same data.
>>
>>
>> ​
>>
>>
>>
>>
>> If I'm right, I will pull request.
>>
>>
>>
>> Thanks
>>
>>
>


Re: Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread fightf...@163.com
Hi,

According to my experience, I would recommend option 3) using Apache Kylin for 
your requirements. 

This is a suggestion based on the open-source world. 

For the per cassandra thing, I accept your advice for the special support 
thing. But the community is very

open and convinient for prompt response. 



fightf...@163.com
 
From: tsh
Date: 2015-11-10 02:56
To: fightf...@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,

I'm in the same position right now: we are going to implement something like 
OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of 
versatile data and the necessity to make something like cubes (Hive and Hive on 
HBase are unsatisfactory). From the other, our users get accustomed to Tableau 
+ Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume 
(has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, 
you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of 
the process too.

Sincerely yours, Tim Shenkao

On 11/09/2015 09:46 AM, fightf...@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark 
sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction 
choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightf...@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightf...@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com



Re: Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread fightf...@163.com
Hi, 

Have you ever considered cassandra as a replacement ? We are now almost the 
seem usage as your engine, e.g. using mysql to store 

initial aggregated data. Can you share more about your kind of Cube queries ? 
We are very interested in that arch too : )

Best,
Sun.


fightf...@163.com
 
From: Andrés Ivaldi
Date: 2015-11-10 07:03
To: tsh
CC: fightf...@163.com; user; dev
Subject: Re: OLAP query using spark dataframe with cassandra
Hi,
I'm also considering something similar, Spark plain is too slow for my case, a 
possible solution is use Spark as Multiple Source connector and basic 
transformation layer, then persist the information (actually is a RDBM), after 
that with our engine we build a kind of Cube queries, and the result is 
processed again by Spark adding Machine Learning.
Our Missing part is reemplace the RDBM with something more suitable and 
scalable than RDBM, dont care about pre processing information if after pre 
processing the queries are fast.

Regards

On Mon, Nov 9, 2015 at 3:56 PM, tsh  wrote:
Hi,

I'm in the same position right now: we are going to implement something like 
OLAP BI + Machine Learning explorations on the same cluster.
Well, the question is quite ambivalent: from one hand, we have terabytes of 
versatile data and the necessity to make something like cubes (Hive and Hive on 
HBase are unsatisfactory). From the other, our users get accustomed to Tableau 
+ Vertica. 
So, right now I consider the following choices:
1) Platfora (not free, I don't know price right now) + Spark
2) AtScale + Tableau(not free, I don't know price right now) + Spark
3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some storage
4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume 
(has somebody use it in production?)
5) Spark + Tableau  (cubes?)

For myself, I decided not to dive into Mesos. Cassandra is hardly configurable, 
you'll have to dedicate special employee to support it.

I'll be glad to hear other ideas & propositions as we are at the beginning of 
the process too.

Sincerely yours, Tim Shenkao


On 11/09/2015 09:46 AM, fightf...@163.com wrote:
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark 
sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction 
choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightf...@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightf...@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com




-- 
Ing. Ivaldi Andres


ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Yuming Wang
Hi



I found org.apache.spark.ml.feature.Word2Vec.transform() very slow.

I think we should not read broadcast every sentence, so I fixed on my forked.



https://github.com/979969786/spark/commit/a9f894df3671bb8df2f342de1820dab3185598f3



I have use 2 number rows test it. Original version consume *5 minutes*,


​

and my version just consume *22 seconds* on same data.


​




If I'm right, I will pull request.



Thanks


[build system] shane OOO until monday, nov 16

2015-11-09 Thread shane knapp
i'll be at the USENIX LISA conference in DC, so josh and jon will be
keeping an eye on jenkins and making sure it doesn't misbehave.

since attending every session of every day will drive one insane, i
will be sporadically checking in and making sure things are humming
along...  but for emergencies, feel free to reach out to either josh
rosen or jon kuroda (CCed on this mail).

danke shane  :)

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Sort Merge Join from the filesystem

2015-11-09 Thread Cheng, Hao
Yes, we definitely need to think how to handle this case, probably even more 
common than both sorted/partitioned tables case, can you jump to the jira and 
leave comment there?

From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Tuesday, November 10, 2015 3:03 AM
To: Cheng, Hao
Cc: Reynold Xin; dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset A 
which is already partitioned/sorted on disk and dataset B, which gets generated 
during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to be 
performed on it, using the same partitioner that was used with dataset A. Then 
dataset B could be joined with dataset A without needing to write it to disk 
first (unless it's too big to fit in memory, then it would need to be 
[partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao 
> wrote:
Yes, we probably need more change for the data source API if we need to 
implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think 
there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky 
> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system 
that have already been partitioned the same with the same number of partitions 
and sorted within each partition, without needing to repartition/sort them 
again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the 
Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be 
joined with dataset B before B even exists, by pre-processing A with a 
partition/sort.

Thanks.




Re: Anyone has perfect solution for spark source code compilation issue on intellij

2015-11-09 Thread Tim Preece
I've had success building with maven  ( 3.3.3 ) with:
Intellij 14.1.5
scala 2.10.4
openjdk 7  (1.7.0_79)

What OS/Platform are you on ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-has-perfect-solution-for-spark-source-code-compilation-issue-on-intellij-tp14887p15088.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: OLAP query using spark dataframe with cassandra

2015-11-09 Thread Ted Yu
Please consider using NoSQL engine such as hbase. 

Cheers

> On Nov 9, 2015, at 3:03 PM, Andrés Ivaldi  wrote:
> 
> Hi,
> I'm also considering something similar, Spark plain is too slow for my case, 
> a possible solution is use Spark as Multiple Source connector and basic 
> transformation layer, then persist the information (actually is a RDBM), 
> after that with our engine we build a kind of Cube queries, and the result is 
> processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and 
> scalable than RDBM, dont care about pre processing information if after pre 
> processing the queries are fast.
> 
> Regards
> 
>> On Mon, Nov 9, 2015 at 3:56 PM, tsh  wrote:
>> Hi,
>> 
>> I'm in the same position right now: we are going to implement something like 
>> OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes 
>> of versatile data and the necessity to make something like cubes (Hive and 
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed 
>> to Tableau + Vertica. 
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some 
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume 
>> (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>> 
>> For myself, I decided not to dive into Mesos. Cassandra is hardly 
>> configurable, you'll have to dedicate special employee to support it.
>> 
>> I'll be glad to hear other ideas & propositions as we are at the beginning 
>> of the process too.
>> 
>> Sincerely yours, Tim Shenkao
>> 
>> 
>>> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>>> Hi, 
>>> 
>>> Thanks for suggesting. Actually we are now evaluating and stressing the 
>>> spark sql on cassandra, while
>>> 
>>> trying to define business models. FWIW, the solution mentioned here is 
>>> different from traditional OLAP
>>> 
>>> cube engine, right ? So we are hesitating on the common sense or direction 
>>> choice of olap architecture. 
>>> 
>>> And we are happy to hear more use case from this community. 
>>> 
>>> Best,
>>> Sun. 
>>> 
>>> fightf...@163.com
>>>  
>>> From: Jörn Franke
>>> Date: 2015-11-09 14:40
>>> To: fightf...@163.com
>>> CC: user; dev
>>> Subject: Re: OLAP query using spark dataframe with cassandra
>>> 
>>> Is there any distributor supporting these software components in 
>>> combination? If no and your core business is not software then you may want 
>>> to look for something else, because it might not make sense to build up 
>>> internal know-how in all of these areas.
>>> 
>>> In any case - it depends all highly on your data and queries. You will have 
>>> to do your own experiments.
>>> 
>>> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
>>> 
 Hi, community
 
 We are specially interested about this featural integration according to 
 some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
 
 seems good implementation for lambda architecure in the open-source world, 
 especially non-hadoop based cluster environment. As we can see, 
 
 the advantages obviously consist of :
 
 1 the feasibility and scalability of spark datafram api, which can also 
 make a perfect complement for Apache Cassandra native cql feature.
 
 2 both streaming and batch process availability using the ALL-STACK thing, 
 cool.
 
 3 we can both achieve compacity and usability for spark with cassandra, 
 including seemlessly integrating with job scheduling and resource 
 management.
 
 Only one concern goes to the OLAP query performance issue, which mainly 
 caused by frequent aggregation work between daily increased large tables, 
 for 
 
 both spark sql and cassandra. I can see that the [1] use case facilitates 
 FiloDB to achieve columnar storage and query performance, but we had 
 nothing more 
 
 knowledge. 
 
 Question is : Any guy had such use case for now, especially using in your 
 production environment ? Would be interested in your architeture for 
 designing this 
 
 OLAP engine using spark +  cassandra. What do you think the comparison 
 between the scenario with traditional OLAP cube design? Like Apache Kylin 
 or 
 
 pentaho mondrian ? 
 
 Best Regards,
 
 Sun.
 
 
 [1]  
 http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
 
 fightf...@163.com
> 
> 
> 
> -- 
> Ing. Ivaldi Andres