from:"Arush Kharbanda"

Re: Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Arush Kharbanda

You can try using Spark Jobserver

https://github.com/spark-jobserver/spark-jobserver

On Wed, Jul 1, 2015 at 4:32 PM, Spark Enthusiast 
wrote:

> Folks,
>
> My Use case is as follows:
>
> My Driver program will be aggregating a bunch of Event Streams and acting
> on it. The Action on the aggregated events is configurable and can change
> dynamically.
>
> One way I can think of is to run the Spark Driver as a Service where a
> config push can be caught via an API that the Driver exports.
> Can I have a Spark Driver Program run as a REST Service by itself? Is this
> a common use case?
> Is there a better way to solve my problem?
>
> Thanks
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How many executors can I acquire in standalone mode ?

2015-05-26 Thread Arush Kharbanda

I believe you would be restricted by the number of cores you have in your
cluster. Having a worker running without a core is useless.

On Tue, May 26, 2015 at 3:04 PM, canan chen  wrote:

> In spark standalone mode, there will be one executor per worker. I am
> wondering how many executor can I acquire when I submit app ? Is it greedy
> mode (as many as I can acquire )?
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Arush Kharbanda

Hi Evo,

Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you
would be able to run multiple executors on the same JVM/worker.

https://issues.apache.org/jira/browse/SPARK-1706.

Thanks
Arush

On Tue, May 26, 2015 at 2:54 PM, canan chen  wrote:

> I think the concept of task in spark should be on the same level of task
> in MR. Usually in MR, we need to specify the memory the each mapper/reducer
> task. And I believe executor is not a user-facing concept, it's a spark
> internal concept. For spark users they don't need to know the concept of
> executor, but need to know the concept of task.
>
> On Tue, May 26, 2015 at 5:09 PM, Evo Eftimov 
> wrote:
>
>> This is the first time I hear that “one can specify the RAM per task” –
>> the RAM is granted per Executor (JVM). On the other hand each Task operates
>> on ONE RDD Partition – so you can say that this is “the RAM allocated to
>> the Task to process” – but it is still within the boundaries allocated to
>> the Executor (JVM) within which the Task is running. Also while running,
>> any Task like any JVM Thread can request as much additional RAM e.g. for
>> new Object instances  as there is available in the Executor aka JVM Heap
>>
>>
>>
>> *From:* canan chen [mailto:ccn...@gmail.com]
>> *Sent:* Tuesday, May 26, 2015 9:30 AM
>> *To:* Evo Eftimov
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: How does spark manage the memory of executor with
>> multiple tasks
>>
>>
>>
>> Yes, I know that one task represent a JVM thread. This is what I
>> confused. Usually users want to specify the memory on task level, so how
>> can I do it if task if thread level and multiple tasks runs in the same
>> executor. And even I don't know how many threads there will be. Besides
>> that, if one task cause OOM, it would cause other tasks in the same
>> executor fail too. There's no isolation between tasks.
>>
>>
>>
>> On Tue, May 26, 2015 at 4:15 PM, Evo Eftimov 
>> wrote:
>>
>> An Executor is a JVM instance spawned and running on a Cluster Node
>> (Server machine). Task is essentially a JVM Thread – you can have as many
>> Threads as you want per JVM. You will also hear about “Executor Slots” –
>> these are essentially the CPU Cores available on the machine and granted
>> for use to the Executor
>>
>>
>>
>> Ps: what creates ongoing confusion here is that the Spark folks have
>> “invented” their own terms to describe the design of their what is
>> essentially a Distributed OO Framework facilitating Parallel Programming
>> and Data Management in a Distributed Environment, BUT have not provided
>> clear dictionary/explanations linking these “inventions” with standard
>> concepts familiar to every Java, Scala etc developer
>>
>>
>>
>> *From:* canan chen [mailto:ccn...@gmail.com]
>> *Sent:* Tuesday, May 26, 2015 9:02 AM
>> *To:* user@spark.apache.org
>> *Subject:* How does spark manage the memory of executor with multiple
>> tasks
>>
>>
>>
>> Since spark can run multiple tasks in one executor, so I am curious to
>> know how does spark manage memory across these tasks. Say if one executor
>> takes 1GB memory, then if this executor can run 10 tasks simultaneously,
>> then each task can consume 100MB on average. Do I understand it correctly ?
>> It doesn't make sense to me that spark run multiple tasks in one executor.
>>
>>
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Websphere MQ as a data source for Apache Spark Streaming

2015-05-25 Thread Arush Kharbanda

Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 
wrote:

> I was digging into the possibilities for Websphere MQ as a data source for
> spark-streaming becuase it is needed in one of our use case. I got to know
> that  MQTT <http://mqtt.org/>   is the protocol that supports the
> communication from MQ data structures but since I am a newbie to spark
> streaming I need some working examples for the same. Did anyone try to
> connect the MQ with spark streaming. Please devise the best way for doing
> so.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: problem with spark thrift server

2015-04-23 Thread Arush Kharbanda

Hi

What do you mean disable the driver? what are you trying to achieve.

Thanks
Arush

On Thu, Apr 23, 2015 at 12:29 PM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:

> Hi ,
> I have a question about spark thrift server , i deployed the spark on yarn
>  and found if the spark driver disable , the spark application will be
> crashed on yarn.  appreciate for any suggestions and idea .
>
> Thank you!
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda

Hi

Can you share your Web UI, depicting your task level breakup.I can see many
thing
s that can be improved.

1. JavaRDD rdds = ...rdds.cache(); ->this caching is not needed as
you are not reading the rdd  for any action

2.Instead of collecting as list, if you can save as text file, it would be
better. As it would avoid moving results to the driver.

Thanks
Arush

On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov  wrote:

> > why are you cache both rdd and table?
> I try to cache all the data to avoid the bad performance for the first
> query. Is it right?
>
> > Which stage of job is slow?
> The query is run many times on one sqlContext and each query execution
> takes 1 second.
>
> 2015-04-23 11:33 GMT+03:00 ayan guha :
>
>> Quick questions: why are you cache both rdd and table?
>> Which stage of job is slow?
>> On 23 Apr 2015 17:12, "Nikolay Tikhonov" 
>> wrote:
>>
>>> Hi,
>>> I have Spark SQL performance issue. My code contains a simple JavaBean:
>>>
>>> public class Person implements Externalizable {
>>> private int id;
>>> private String name;
>>> private double salary;
>>> 
>>> }
>>>
>>>
>>> Apply a schema to an RDD and register table.
>>>
>>> JavaRDD rdds = ...
>>> rdds.cache();
>>>
>>> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
>>> dataFrame.registerTempTable("person");
>>>
>>> sqlContext.cacheTable("person");
>>>
>>>
>>> Run sql query.
>>>
>>> sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >=
>>> YYY
>>> AND salary <= XXX").collectAsList()
>>>
>>>
>>> I launch standalone cluster which contains 4 workers. Each node runs on
>>> machine with 8 CPU and 15 Gb memory. When I run the query on the
>>> environment
>>> over RDD which contains 1 million persons it takes 1 minute. Somebody can
>>> tell me how to tuning the performance?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda

Yes, i am able to reproduce the problem. Do you need the scripts to create
the tables?

On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai  wrote:

> Can your code that can reproduce the problem?
>
> On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> Hi
>>
>> As per JIRA this issue is resolved, but i am still facing this issue.
>>
>> SPARK-2734 - DROP TABLE should also uncache table
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

[SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda

Hi

As per JIRA this issue is resolved, but i am still facing this issue.

SPARK-2734 - DROP TABLE should also uncache table


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Checking Data Integrity in Spark

2015-03-27 Thread Arush Kharbanda

Its not possible to configure Spark to do checks based on xmls. You would
need to write jobs to do the validations you need.

On Fri, Mar 27, 2015 at 5:13 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Hello,
>
> I want to check if there is any way to check the data integrity of the
> data files. The use case is perform data integrity check on large files
> 100+ columns and reject records (write it another file) that does not meet
> criteria's (such as NOT NULL, date format, etc). Since there are lot of
> columns/integrity rules we should able to data integrity check through
> configurations (like xml, json, etc); Please share your thoughts..
>
>
> Thanks
>
> Sathish
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda

Seems Spark SQL accesses some more columns apart from those created by hive.

You can always recreate the tables, you would need to execute the table
creation scripts but it would be good to avoid recreation.

On Fri, Mar 27, 2015 at 3:20 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I did copy hive-conf.xml form Hive installation into spark-home/conf. IT
> does have all the meta store connection details, host, username, passwd,
> driver and others.
>
>
>
> Snippet
> ==
>
>
> 
>
> 
>   javax.jdo.option.ConnectionURL
>   jdbc:mysql://host.vip.company.com:3306/HDB
> 
>
> 
>   javax.jdo.option.ConnectionDriverName
>   com.mysql.jdbc.Driver
>   Driver class name for a JDBC metastore
> 
>
> 
>   javax.jdo.option.ConnectionUserName
>   hiveuser
>   username to use against metastore database
> 
>
> 
>   javax.jdo.option.ConnectionPassword
>   some-password
>   password to use against metastore database
> 
>
> 
>   hive.metastore.local
>   false
>   controls whether to connect to remove metastore server or
> open a new metastore server in Hive Client JVM
> 
>
> 
>   hive.metastore.warehouse.dir
>   /user/hive/warehouse
>   location of default database for the warehouse
> 
>
> ..
>
>
>
> When i attempt to read hive table, it does not work. dw_bid does not
> exists.
>
> I am sure there is a way to read tables stored in HDFS (Hive) from Spark
> SQL. Otherwise how would anyone do analytics since the source tables are
> always either persisted directly on HDFS or through Hive.
>
>
> On Fri, Mar 27, 2015 at 1:15 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> Since hive and spark SQL internally use HDFS and Hive metastore. The only
>> thing you want to change is the processing engine. You can try to bring
>> your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive
>> site xml captures the metastore connection details).
>>
>> Its a hack,  i havnt tried it. I have played around with the metastore
>> and it should work.
>>
>> On Fri, Mar 27, 2015 at 12:04 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>> wrote:
>>
>>> I have few tables that are created in Hive. I wan to transform data
>>> stored in these Hive tables using Spark SQL. Is this even possible ?
>>>
>>> So far i have seen that i can create new tables using Spark SQL dialect.
>>> However when i run show tables or do desc hive_table it says table not
>>> found.
>>>
>>> I am now wondering is this support present or not in Spark SQL ?
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>
>
> --
> Deepak
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda

Since hive and spark SQL internally use HDFS and Hive metastore. The only
thing you want to change is the processing engine. You can try to bring
your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive
site xml captures the metastore connection details).

Its a hack,  i havnt tried it. I have played around with the metastore and
it should work.

On Fri, Mar 27, 2015 at 12:04 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I have few tables that are created in Hive. I wan to transform data stored
> in these Hive tables using Spark SQL. Is this even possible ?
>
> So far i have seen that i can create new tables using Spark SQL dialect.
> However when i run show tables or do desc hive_table it says table not
> found.
>
> I am now wondering is this support present or not in Spark SQL ?
>
> --
> Deepak
>
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda

You can look at the Spark SQL programming guide.
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html

and the Spark API.
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package

On Thu, Mar 26, 2015 at 5:21 PM, Masf  wrote:

> Ok,
>
> Thanks. Some web resource where I could check the functionality supported
> by Spark SQL?
>
> Thanks!!!
>
> Regards.
> Miguel Ángel.
>
> On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian 
> wrote:
>
>>  We're working together with AsiaInfo on this. Possibly will deliver an
>> initial version of window function support in 1.4.0. But it's not a promise
>> yet.
>>
>> Cheng
>>
>> On 3/26/15 7:27 PM, Arush Kharbanda wrote:
>>
>> Its not yet implemented.
>>
>>  https://issues.apache.org/jira/browse/SPARK-1442
>>
>> On Thu, Mar 26, 2015 at 4:39 PM, Masf  wrote:
>>
>>> Hi.
>>>
>>>  Are the Windowing and Analytics functions supported in Spark SQL (with
>>> HiveContext or not)? For example in Hive is supported
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>
>>>
>>>  Some tutorial or documentation where I can see all features supported
>>> by Spark SQL?
>>>
>>>
>>>  Thanks!!!
>>> --
>>>
>>>
>>> Regards.
>>> Miguel Ángel
>>>
>>
>>
>>
>>  --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda

Its not yet implemented.

https://issues.apache.org/jira/browse/SPARK-1442

On Thu, Mar 26, 2015 at 4:39 PM, Masf  wrote:

> Hi.
>
> Are the Windowing and Analytics functions supported in Spark SQL (with
> HiveContext or not)? For example in Hive is supported
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>
>
> Some tutorial or documentation where I can see all features supported by
> Spark SQL?
>
>
> Thanks!!!
> --
>
>
> Regards.
> Miguel Ángel
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda

er.YarnClusterScheduler: Created
> YarnClusterScheduler
> 15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID
> is not set.
> 15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on
> 49982
> 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-10-80-175-92.ec2.internal:49982 with 265.4 MB RAM,
> BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982)
> 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager
> *Exception in thread "main" java.lang.NullPointerException*
> *at
>
> org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:581)*
> at
>
> org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
> at org.apache.spark.SparkContext.(SparkContext.scala:541)
> at
>
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642)
> at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:75)
> at
>
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:132)
> at
>
> org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN.main(JavaKinesisWordCountASLYARN.java:127)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JavaKinesisWordCountASLYARN-Example-not-working-on-EMR-tp6.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark SQL: Day of month from Timestamp

2015-03-24 Thread Arush Kharbanda

Hi

You can use functions like year(date),month(date)

Thanks
Arush

On Tue, Mar 24, 2015 at 12:46 PM, Harut Martirosyan <
harut.martiros...@gmail.com> wrote:

> Hi guys.
>
> Basically, we had to define a UDF that does that, is there a built in
> function that we can use for it?
>
> --
> RGRDZ Harut
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Arush Kharbanda

A basis change needed by spark is setting the executor memory which
defaults to 512MB by default.

On Mon, Mar 23, 2015 at 10:16 AM, Denny Lee  wrote:

> How are you running your spark instance out of curiosity?  Via YARN or
> standalone mode?  When connecting Spark thriftserver to the Spark service,
> have you allocated enough memory and CPU when executing with spark?
>
> On Sun, Mar 22, 2015 at 3:39 AM fanooos  wrote:
>
>> We have cloudera CDH 5.3 installed on one machine.
>>
>> We are trying to use spark sql thrift server to execute some analysis
>> queries against hive table.
>>
>> Without any changes in the configurations, we run the following query on
>> both hive and spark sql thrift server
>>
>> *select * from tableName;*
>>
>> The time taken by spark is larger than the time taken by hive which is not
>> supposed to be the like that.
>>
>> The hive table is mapped to json files stored on HDFS directory and we are
>> using *org.openx.data.jsonserde.JsonSerDe* for
>> serialization/deserialization.
>>
>> Why spark takes much more time to execute the query than hive ?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-sql-thrift-server-slower-than-
>> hive-tp22177.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: sparksql native jdbc driver

2015-03-18 Thread Arush Kharbanda

Yes, I have been using Spark SQL from the onset. Haven't found any other
Server for Spark SQL for JDBC connectivity.

On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb 
wrote:

> hey guys,
>
> In my understanding SparkSQL only supports JDBC connection through hive
> thrift server, is this correct?
>
> Thanks
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Idempotent count

2015-03-18 Thread Arush Kharbanda

Hi Binh,

It stores the state as well the unprocessed data.  It is a subset of the
records that you aggregated so far.

This provides a good reference for checkpointing.

http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing


On Wed, Mar 18, 2015 at 12:52 PM, Binh Nguyen Van 
wrote:

> Hi Arush,
>
> Thank you for answering!
> When you say checkpoints hold metadata and Data, what is the Data? Is it
> the Data that is pulled from input source or is it the state?
> If it is state then is it the same number of records that I aggregated
> since beginning or only a subset of it? How can I limit the size of
> state that is kept in checkpoint?
>
> Thank you
> -Binh
>
> On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> Hi
>>
>> Yes spark streaming is capable of stateful stream processing. With or
>> without state is a way of classifying state.
>> Checkpoints hold metadata and Data.
>>
>> Thanks
>>
>>
>> On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am new to Spark so please forgive me if my questions is stupid.
>>> I am trying to use Spark-Streaming in an application that read data
>>> from a queue (Kafka) and do some aggregation (sum, count..) and
>>> then persist result to an external storage system (MySQL, VoltDB...)
>>>
>>> From my understanding of Spark-Streaming, I can have two ways
>>> of doing aggregation:
>>>
>>>- Stateless: I don't have to keep state and just apply new delta
>>>values to the external system. From my understanding, doing in this way I
>>>may end up with over counting when there is failure and replay.
>>>- Statefull: Use checkpoint to keep state and blindly save new state
>>>to external system. Doing in this way I have correct aggregation result 
>>> but
>>>I have to keep data in two places (state and external system)
>>>
>>> My questions are:
>>>
>>>- Is my understanding of Stateless and Statefull aggregation
>>>correct? If not please correct me!
>>>- For the Statefull aggregation, What does Spark-Streaming keep when
>>>it saves checkpoint?
>>>
>>> Please kindly help!
>>>
>>> Thanks
>>> -Binh
>>>
>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Idempotent count

2015-03-17 Thread Arush Kharbanda

Hi

Yes spark streaming is capable of stateful stream processing. With or
without state is a way of classifying state.
Checkpoints hold metadata and Data.

Thanks


On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van  wrote:

> Hi all,
>
> I am new to Spark so please forgive me if my questions is stupid.
> I am trying to use Spark-Streaming in an application that read data
> from a queue (Kafka) and do some aggregation (sum, count..) and
> then persist result to an external storage system (MySQL, VoltDB...)
>
> From my understanding of Spark-Streaming, I can have two ways
> of doing aggregation:
>
>- Stateless: I don't have to keep state and just apply new delta
>values to the external system. From my understanding, doing in this way I
>may end up with over counting when there is failure and replay.
>- Statefull: Use checkpoint to keep state and blindly save new state
>to external system. Doing in this way I have correct aggregation result but
>I have to keep data in two places (state and external system)
>
> My questions are:
>
>- Is my understanding of Stateless and Statefull aggregation correct?
>If not please correct me!
>- For the Statefull aggregation, What does Spark-Streaming keep when
>it saves checkpoint?
>
> Please kindly help!
>
> Thanks
> -Binh
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Hive on Spark with Spark as a service on CDH5.2

2015-03-16 Thread Arush Kharbanda

Hive on Spark and accessing HiveContext from the shall are seperate things.

Hive on Spark -
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

To access hive on Spark you need to built with -Phive.

http://spark.apache.org/docs/1.2.1/building-spark.html#building-with-hive-and-jdbc-support

On Tue, Mar 17, 2015 at 11:35 AM, anu  wrote:

> *I am not clear if spark sql supports HIve on Spark when spark is run as a
> service in CDH 5.2? *
>
> Can someone please clarify this. If this is possible, how what
> configuration
> changes have I to make to import hive context in spark shell as well as to
> be able to do a spark-submit for the job to be run on the entire cluster.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-with-Spark-as-a-service-on-CDH5-2-tp22091.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank

2015-03-16 Thread Arush Kharbanda

You can track the issue here.

https://issues.apache.org/jira/browse/SPARK-1442

Its currently not supported, i guess the test cases are work in progress.


On Mon, Mar 16, 2015 at 12:44 PM, hseagle  wrote:

> Hi all,
>
>  I'm wondering whether the latest spark-1.3.0 supports the windowing
> and
> analytic funtions in hive, such as row_number, rank and etc.
>
>  Indeed, I've done some testing by using spark-shell and found that
> row_number is not supported yet.
>
>  But I still found that there were some test case related to row_number
> and other analytics functions. These test cases is defined in
>
> sql/hive/target/scala-2.10/test-classes/ql/src/test/queries/clientpositive/windowing_multipartitioning.q
>
> So my question is divided into two parts, One is whether the analytics
> function is supported or not, the othere one is that if it's not supported
> why there are still some test cases
>
> hseagle
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-1-3-0-support-the-analytic-functions-defined-in-Hive-such-as-row-number-rank-tp22072.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: can not submit job to spark in windows

2015-03-12 Thread Arush Kharbanda

essfully started service 'HTTP file
> server' on
>  port 56641.
> 15/02/26 18:21:41 INFO Utils: Successfully started service 'SparkUI' on
> port
> 404
> 0.
> 15/02/26 18:21:41 INFO SparkUI: Started SparkUI at http://mypc:4040
> 15/02/26 18:21:42 INFO Utils: Copying
> C:\spark-1.2.1-bin-hadoop2.4\bin\pi.py
> to
>
> C:\Users\sergun\AppData\Local\Temp\spark-76a21028-ccce-4308-9e70-09c3cfa76477\
> spark-56b32155-2779-4345-9597-2bfa6a87a51d\pi.py
> Traceback (most recent call last):
>   File "C:/spark-1.2.1-bin-hadoop2.4/bin/pi.py", line 29, in 
> sc = SparkContext(appName="PythonPi")
>   File "C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py", line 105,
> in __
> init__
> conf, jsc)
>   File "C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py", line 153,
> in _d
> o_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py", line 202,
> in _i
> nitialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File
> "C:\spark-1.2.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g
> ateway.py", line 701, in __call__
>   File
> "C:\spark-1.2.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc
> ol.py", line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spa
> rk.api.java.JavaSparkContext.
> : java.lang.NullPointerException
> at java.lang.ProcessBuilder.start(Unknown Source)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
> at org.apache.hadoop.util.Shell.run(Shell.java:418)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
> 650)
> at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
> at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:445)
> at org.apache.spark.SparkContext.addFile(SparkContext.scala:1004)
> at
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
> 8)
> at
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
> 8)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.apache.spark.SparkContext.(SparkContext.scala:288)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.sc
> ala:61)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Sou
> rce)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:214)
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
> .java:79)
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Unknown Source)
>
> What is wrong on my side?
>
> Should I run some scripts before spark-submit.cmd?
>
> Regards,
> Sergey.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/can-not-submit-job-to-spark-in-windows-tp21824.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Stand-alone Spark on windows

2015-03-12 Thread Arush Kharbanda

Hi

Can you share the complete stack trace

Thanks
Arush

On Thu, Feb 26, 2015 at 6:09 PM, Sergey Gerasimov  wrote:

> Hi!
>
> I downloaded Spark binaries unpacked and could successfully run pyspark
> shell and write and execute some code here
>
> BUT
>
> I failed with submitting stand-alone python scripts or jar files via
> spark-submit:
> spark-submit pi.py
>
> I always get exception stack trace with NullPointerException in
> java.lang.ProcessBuilder.start().
>
> What could be wrong?
>
> Should I start some scripts before spark-submit?
>
> I have windows 7 and spark 1.2.1
>
> Sergey.
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How to read from hdfs using spark-shell in Intel hadoop?

2015-03-11 Thread Arush Kharbanda

You can add resolvers on SBT using

resolvers +=
  "Sonatype OSS Snapshots" at
"https://oss.sonatype.org/content/repositories/snapshots";


On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW 
wrote:

> Hi,
>
> I am not able to read from HDFS(Intel distribution hadoop,Hadoop version
> is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the
> command
> mvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and read
> a HDFS file using sc.textFile() and the exception is
>
>  WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to
> deadNodes and continuejava.net.SocketTimeoutException: 12 millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264
> remote=/10.88.6.133:50010]
>
> The same problem is asked in the this mail.
>  RE: Spark is unable to read from HDFS
> <http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E>
>
>
>
>
>
>
> RE: Spark is unable to read from HDFS
> <http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E>
> Hi, Thanks for the reply. I've tried the below.
> View on mail-archives.us.apache.org
> <http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E>
> Preview by Yahoo
>
>
> As suggested in the above mail,
> *"In addition to specifying HADOOP_VERSION=1.0.3 in the
> ./project/SparkBuild.scala file, you will need to specify the
> libraryDependencies and name "spark-core"  resolvers. Otherwise, sbt will
> fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can
> set up your own local or remote repository that you specify" *
>
> Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can
> anybody please elaborate on how to specify tat SBT should fetch hadoop-core
> from Intel which is in our internal repository?
>
> Thanks & Regards,
> Meethu M
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Arush Kharbanda

Why don't you formulate a string before you pass it to the hql function
(appending strings), and hql function is deprecated. You should use sql.

http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext

On Wed, Mar 4, 2015 at 6:15 AM, Anusha Shamanur 
wrote:

> Hi,
>
>
> I am trying to run a simple select query on a table.
>
>
> val restaurants=hiveCtx.hql("select * from TableName where column like
> '%SomeString%' ")
>
> This gives an error as below:
>
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: *, tree:
>
> How do I solve this?
>
>
> --
> Regards,
> Anusha
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Arush Kharbanda

ala:104)
>> at
>> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
>> at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
>> at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
>> at
>> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1454)
>> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1450)
>> at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:156)
>> at org.apache.spark.SparkContext.(SparkContext.scala:203)
>> at
>> com.algofusion.reconciliation.execution.utils.ExecutionUtils.(ExecutionUtils.java:130)
>> at
>> com.algofusion.reconciliation.execution.ReconExecutionController.initialize(ReconExecutionController.java:257)
>> at
>> com.algofusion.reconciliation.execution.ReconExecutionController.main(ReconExecutionController.java:105)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [1 milliseconds]
>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at akka.remote.Remoting.start(Remoting.scala:173)
>> ... 18 more
>> ]
>> Exception in thread "main" java.lang.ExceptionInInitializerError
>> at
>> com.algofusion.reconciliation.execution.ReconExecutionController.initialize(ReconExecutionController.java:257)
>> at
>> com.algofusion.reconciliation.execution.ReconExecutionController.main(ReconExecutionController.java:105)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [1 milliseconds]
>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at akka.remote.Remoting.start(Remoting.scala:173)
>> at
>> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
>> at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
>> at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
>> at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
>> at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
>> at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
>> at
>> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
>> at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
>> at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
>> at
>> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1454)
>> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1450)
>> at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:156)
>> at org.apache.spark.SparkContext.(SparkContext.scala:203)
>> at
>> com.algofusion.reconciliation.execution.utils.ExecutionUtils.(ExecutionUtils.java:130)
>> ... 2 more
>>
>> Regards,
>> Sarath.
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-04 Thread Arush Kharbanda

For java You can use hive-jdbc connectivity jars to connect to Spark-SQL.

The driver is inside the hive-jdbc Jar.

*http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html
<http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html>*




On Wed, Mar 4, 2015 at 1:26 PM,  wrote:

> SparkSQL supports JDBC/ODBC connectivity, so if that's the route you
> needed/wanted to connect through you could do so via java/php apps.  Havent
> used either so cant speak to the developer experience, assume its pretty
> good as would be preferred method for lots of third party enterprise
> apps/tooling
>
> If you prefer using the thrift server/interface, if they don't exist
> already
> in open source land you can use thrift definitions to generate client libs
> in any supported thrift language and use that for connectivity.  Seems one
> issue with thrift-server is when running in cluster mode.  Seems like it
> still exists but UX of error has been cleaned up in 1.3:
>
> https://issues.apache.org/jira/browse/SPARK-5176
>
>
>
> -Original Message-
> From: fanooos [mailto:dev.fano...@gmail.com]
> Sent: Tuesday, March 3, 2015 11:15 PM
> To: user@spark.apache.org
> Subject: Connecting a PHP/Java applications to Spark SQL Thrift Server
>
> We have installed hadoop cluster with hive and spark and the spark sql
> thrift server is up and running without any problem.
>
> Now we have set of applications need to use spark sql thrift server to
> query
> some data.
>
> Some of these applications are java applications and the others are PHP
> applications.
>
> As I am an old fashioned java developer, I used to connect java
> applications
> to BD servers like Mysql using a JDBC driver. Is there a corresponding
> driver for connecting with Spark Sql Thrift server ? Or what is the library
> I need to use to connect to it?
>
>
> For PHP, what are the ways we can use to connect PHP applications to Spark
> Sql Thrift Server?
>
>
>
>
>
> --
> View this message in context:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-a-PHP-Java-ap
> plications-to-Spark-SQL-Thrift-Server-tp21902.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-27 Thread Arush Kharbanda

Can you share what error you are getting when the job fails.

On Thu, Feb 26, 2015 at 4:32 AM, Darin McBeath 
wrote:

> I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8
> r3.8xlarge machines but limit the job to only 128 cores.  I have also tried
> other things such as setting 4 workers per r3.8xlarge and 67gb each but
> this made no difference.
>
> The job frequently fails at the end in this step (saveasHadoopFile).   It
> will sometimes work.
>
> finalNewBaselinePairRDD is hashPartitioned with 1024 partitions and a
> total size around 1TB.  There are about 13.5M records in
> finalNewBaselinePairRDD.  finalNewBaselinePairRDD is 
>
>
> JavaPairRDD finalBaselineRDDWritable =
> finalNewBaselinePairRDD.mapToPair(new
> ConvertToWritableTypes()).persist(StorageLevel.MEMORY_AND_DISK_SER());
>
> // Save to hdfs (gzip)
> finalBaselineRDDWritable.saveAsHadoopFile("hdfs:///sparksync/",
> Text.class, Text.class,
> SequenceFileOutputFormat.class,org.apache.hadoop.io.compress.GzipCodec.class);
>
>
> If anyone has any tips for what I should look into it would be appreciated.
>
> Thanks.
>
> Darin.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How to pass a org.apache.spark.rdd.RDD in a recursive function

2015-02-27 Thread Arush Kharbanda

Passing RDD's around is not a good idea. RDD's are immutable and cant be
changed inside functions. Have you considered taking a different approach?

On Thu, Feb 26, 2015 at 3:42 AM, dritanbleco  wrote:

> Hello
>
> i am trying to pass as a parameter a org.apache.spark.rdd.RDD table to a
> recursive function. This table should be changed in any step of the
> recursion and could not be just a global var
>
> need help :)
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-a-org-apache-spark-rdd-RDD-in-a-recursive-function-tp21805.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: On app upgrade, restore sliding window data.

2015-02-24 Thread Arush Kharbanda

I think this could be of some help to you.

https://issues.apache.org/jira/browse/SPARK-3660



On Tue, Feb 24, 2015 at 2:18 AM, Matus Faro  wrote:

> Hi,
>
> Our application is being designed to operate at all times on a large
> sliding window (day+) of data. The operations performed on the window
> of data will change fairly frequently and I need a way to save and
> restore the sliding window after an app upgrade without having to wait
> the duration of the sliding window to "warm up". Because it's an app
> upgrade, checkpointing will not work unfortunately.
>
> I can potentially dump the window to an outside storage periodically
> or on app shutdown, but I don't have an ideal way of restoring it.
>
> I thought about two non-ideal solutions:
> 1. Load the previous data all at once into the sliding window on app
> startup. The problem is, at one point I will have double the data in
> the sliding window until the initial batch of data goes out of scope.
> 2. Broadcast the previous state of the window separately from the
> window. Perform the operations on both sets of data until it comes out
> of scope. The problem is, the data will not fit into memory.
>
> Solutions that would solve my problem:
> 1. Ability to pre-populate sliding window.
> 2. Have control over batch slicing. It would be nice for a Receiver to
> dictate the current batch timestamp in order to slow down or fast
> forward time.
>
> Any feedback would be greatly appreciated!
>
> Thank you,
> Matus
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: updateStateByKey and invFunction

2015-02-24 Thread Arush Kharbanda

You can use a reduceByKeyAndWindow with your specific time window. You can
specify the inverse function in reduceByKeyAndWindow.

On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma 
wrote:

> So say I want to calculate top K users visiting a page in the past 2 hours
> updated every 5 mins.
>
> so here I want to maintain something like this
>
> Page_01 => {user_01:32, user_02:3, user_03:7...}
> ...
>
> Basically a count of number of times a user visited a page. Here my key is
> page name/id and state is the hashmap.
>
> Now in updateStateByKey I get the previous state and new events coming
> *in* the window. Is there a way to also get the events going *out* of the
> window? This was I can incrementally update the state over a rolling window.
>
> What is the efficient way to do it in spark streaming?
>
> Thanks
> Ashish
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How to start spark-shell with YARN?

2015-02-23 Thread Arush Kharbanda

Hi

Are you sure that you built Spark for Yarn.If standalone works, not sure if
its build for Yarn.

Thanks
Arush

On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen  wrote:

> Hi,
>
> I followed this guide,
> http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
> start spark-shell with yarn-client
>
> ./bin/spark-shell --master yarn-client
>
>
> But I got
>
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now gated for 
> [5000] ms. Reason is: [Disassociated].
>
> In the spark-shell, and other exceptions in they yarn log. Please see
> http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
> for more detail.
>
>
> However, submitting to the this cluster works. Also, spark-shell as
> standalone works.
>
>
> My system:
>
> - ubuntu amd64
> - spark 1.2.1
> - yarn from hadoop 2.6 stable
>
>
> Thanks,
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Arush Kharbanda

Monoids are useful in Aggregations and try avoiding Anonymous functions,
creating out functions out of the spark code allows the functions to be
reused(Possibly between Spark and Spark Streaming)

On Thu, Feb 19, 2015 at 6:56 AM, Jean-Pascal Billaud 
wrote:

> Thanks Arush. I will check that out.
>
> On Wed, Feb 18, 2015 at 11:06 AM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> I find monoids pretty useful in this respect, basically separating out
>> the logic in a monoid and then applying the logic to either a stream or a
>> batch. A list of such practices could be really useful.
>>
>> On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud 
>> wrote:
>>
>>> Hey,
>>>
>>> It seems pretty clear that one of the strength of Spark is to be able to
>>> share your code between your batch and streaming layer. Though, given that
>>> Spark streaming uses DStream being a set of RDDs and Spark uses a single
>>> RDD there might some complexity associated with it.
>>>
>>> Of course since DStream is a superset of RDDs, one can just run the same
>>> code at the RDD granularity using DStream::forEachRDD. While this should
>>> work for map, I am not sure how that can work when it comes to reduce phase
>>> given that a group of keys spans across multiple RDDs.
>>>
>>> One of the option is to change the dataset object on which a job works
>>> on. For example of passing an RDD to a class method, one passes a higher
>>> level object (MetaRDD) that wraps around RDD or DStream depending the
>>> context. At this point the job calls its regular maps, reduces and so on
>>> and the MetaRDD wrapper would delegate accordingly.
>>>
>>> Just would like to know the official best practice from the spark
>>> community though.
>>>
>>> Thanks,
>>>
>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Arush Kharbanda

I find monoids pretty useful in this respect, basically separating out the
logic in a monoid and then applying the logic to either a stream or a
batch. A list of such practices could be really useful.

On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud 
wrote:

> Hey,
>
> It seems pretty clear that one of the strength of Spark is to be able to
> share your code between your batch and streaming layer. Though, given that
> Spark streaming uses DStream being a set of RDDs and Spark uses a single
> RDD there might some complexity associated with it.
>
> Of course since DStream is a superset of RDDs, one can just run the same
> code at the RDD granularity using DStream::forEachRDD. While this should
> work for map, I am not sure how that can work when it comes to reduce phase
> given that a group of keys spans across multiple RDDs.
>
> One of the option is to change the dataset object on which a job works on.
> For example of passing an RDD to a class method, one passes a higher level
> object (MetaRDD) that wraps around RDD or DStream depending the context. At
> this point the job calls its regular maps, reduces and so on and the
> MetaRDD wrapper would delegate accordingly.
>
> Just would like to know the official best practice from the spark
> community though.
>
> Thanks,
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How to integrate hive on spark

2015-02-18 Thread Arush Kharbanda

Hi

Did you try these steps.


https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Thanks
Arush

On Wed, Feb 18, 2015 at 7:20 PM, sandeepvura  wrote:

> Hi ,
>
> I am new to sparks.I had installed spark on 3 node cluster.I would like to
> integrate hive on spark .
>
> can anyone please help me on this,
>
> Regards,
> Sandeep.v
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-hive-on-spark-tp21702.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Arush Kharbanda

I am running Spark Jobs behind tomcat. We didn't face any issues.But for us
the user base is very small.

The possible blockers could be
1. If there are many users of the system. Then jobs might have to w8, you
might want to think about the kind of scheduling you want to do.
2.Again if the no of users is a bit high, Tomcat doesn't scale really
well.(Not sure how much of a blocker it is).

Thanks
Arush

On Wed, Feb 18, 2015 at 6:41 PM, Ralph Bergmann | the4thFloor.eu <
ra...@the4thfloor.eu> wrote:

> Hi,
>
>
> I have dependency problems to use spark-core inside of a HttpServlet
> (see other mail from me).
>
> Maybe I'm wrong?!
>
> What I want to do: I develop a mobile app (Android and iOS) and want to
> connect them with Spark on backend side.
>
> To do this I want to use Tomcat. The app uses https to ask Tomcat for
> the needed data and Tomcat asks Spark.
>
> Is this the right way? Or is there a better way to connect my mobile
> apps with the Spark backend?
>
> I hope that I'm not the first one who want to do this.
>
>
>
> Ralph
>
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Cannot access Spark web UI

2015-02-18 Thread Arush Kharbanda

ler$CachedChain.doFilter(ServletHandler.java:1212)
>> at
>> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1223)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>> at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>> at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767)
>> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>> at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>> at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>> at org.mortbay.jetty.Server.handle(Server.java:326)
>> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>> at
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>> at
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>> at
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>>
>> Powered by Jetty://
>>
>> --
>> Thanks & Regards,
>>
>> *Mukesh Jha *
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: JsonRDD to parquet -- data loss

2015-02-17 Thread Arush Kharbanda

I am not sure, if this the easiest way to solve your problem. But you can
connect to the HIVE metastore(through derby) and find the HDFS path from
there.

On Wed, Feb 18, 2015 at 9:31 AM, Vasu C  wrote:

> Hi,
>
> I am running spark batch processing job using spark-submit command. And
> below is my code snippet.  Basically converting JsonRDD to parquet and
> storing it in HDFS location.
>
> The problem I am facing is if multiple jobs are are triggered parallely,
> even though job executes properly (as i can see in spark webUI), there is
> no parquet file created in hdfs path. If 5 jobs are executed parallely than
> only 3 parquet files are getting created.
>
> Is this the data loss scenario ? Or am I missing something here. Please
> help me in this
>
> Here tableName is unique with timestamp appended to it.
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> val jsonRdd  = sqlContext.jsonRDD(results)
>
> val parquetTable = sqlContext.parquetFile(parquetFilePath)
>
> parquetTable.registerTempTable(tableName)
>
> jsonRdd.insertInto(tableName)
>
>
> Regards,
>
>   Vasu C
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: spark-core in a servlet

2015-02-17 Thread Arush Kharbanda

I am not sure if this could be causing the issue but spark  is compatible
with scala 2.10.
Instead of spark-core_2.11 you might want to try spark-core_2.10

On Wed, Feb 18, 2015 at 5:44 AM, Ralph Bergmann | the4thFloor.eu <
ra...@the4thfloor.eu> wrote:

> Hi,
>
>
> I want to use spark-core inside of a HttpServlet. I use Maven for the
> build task but I have a dependency problem :-(
>
> I get this error message:
>
> ClassCastException:
>
> com.sun.jersey.server.impl.container.servlet.JerseyServletContainerInitializer
> cannot be cast to javax.servlet.ServletContainerInitializer
>
> When I add this exclusions it builds but than there are other classes
> not found at runtime:
>
>   
>  org.apache.spark
>  spark-core_2.11
>  1.2.1
>  
> 
>org.apache.hadoop
>hadoop-client
> 
> 
>org.eclipse.jetty
>*
> 
>  
>   
>
>
> What can I do?
>
>
> Thanks a lot!,
>
> Ralph
>
> --
>
> Ralph Bergmann
>
> iOS and Android app developer
>
>
> www  http://www.the4thFloor.eu
>
> mail ra...@the4thfloor.eu
> skypedasralph
>
> google+  https://plus.google.com/+RalphBergmann
> xing https://www.xing.com/profile/Ralph_Bergmann3
> linkedin https://www.linkedin.com/in/ralphbergmann
> gulp https://www.gulp.de/Profil/RalphBergmann.html
> github   https://github.com/the4thfloor
>
>
> pgp key id   0x421F9B78
> pgp fingerprint  CEE3 7AE9 07BE 98DF CD5A E69C F131 4A8E 421F 9B78
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-17 Thread Arush Kharbanda

Hi

Did you try to make maven pick the latest version

http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management

That way solrj won't cause any issue, you can try this and check if the
part of your code where you access HDFS works fine?



On Wed, Feb 18, 2015 at 10:23 AM, dgoldenberg 
wrote:

> I'm getting the below error when running spark-submit on my class. This
> class
> has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ
> 4.10.3 from within the class.
>
> This is in conflict with the older version, HttpClient 3.1 that's a
> dependency of Hadoop 2.4 (I'm running Spark 1.2.1 built for Hadoop 2.4).
>
> I've tried setting spark.files.userClassPathFirst to true in SparkConf in
> my
> program, also setting it to true in  $SPARK-HOME/conf/spark-defaults.conf
> as
>
> spark.files.userClassPathFirst true
>
> No go, I'm still getting the error, as below. Is there anything else I can
> try? Are there any plans in Spark to support multiple class loaders?
>
> Exception in thread "main" java.lang.NoSuchMethodError:
>
> org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/scheme/SchemeRegistry;
> at
>
> org.apache.http.impl.client.SystemDefaultHttpClient.createClientConnectionManager(SystemDefaultHttpClient.java:121)
> at
>
> org.apache.http.impl.client.AbstractHttpClient.getConnectionManager(AbstractHttpClient.java:445)
> at
>
> org.apache.solr.client.solrj.impl.HttpClientUtil.setMaxConnections(HttpClientUtil.java:206)
> at
>
> org.apache.solr.client.solrj.impl.HttpClientConfigurer.configure(HttpClientConfigurer.java:35)
> at
>
> org.apache.solr.client.solrj.impl.HttpClientUtil.configureClient(HttpClientUtil.java:142)
> at
>
> org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:118)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.(HttpSolrServer.java:168)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.(HttpSolrServer.java:141)
> ...
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Class-loading-issue-spark-files-userClassPathFirst-doesn-t-seem-to-be-working-tp21693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Arush Kharbanda

Hi

How big is your dataset?

Thanks
Arush

On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate 
wrote:

> Thank you very much for your reply!
>
> My task is to count the number of word pairs in a document. If w1 and w2
> occur together in one sentence, the number of occurrence of word pair (w1,
> w2) adds 1. So the computational part of this algorithm is simply a
> two-level for-loop.
>
> Since the cluster is monitored by Ganglia, I can easily see that neither
> CPU or network IO is under pressure. The only parameter left is memory. In
> the "executor" tab of Spark Web UI, I can see a column named "memory used".
> It showed that only 6GB of 20GB memory is used. I understand this is
> measuring the size of RDD that persist in memory. So can I at least assume
> the data/object I used in my program is not exceeding memory limit?
>
> My confusion here is, why can't my program run faster while there is still
> efficient memory, CPU time and network bandwidth it can utilize?
>
> Best regards,
> Julaiti
>
>
> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das 
> wrote:
>
>> What application are you running? Here's a few things:
>>
>> - You will hit bottleneck on CPU if you are doing some complex
>> computation (like parsing a json etc.)
>> - You will hit bottleneck on Memory if your data/objects used in the
>> program is large (like defining playing with HashMaps etc inside your map*
>> operations), Here you can set spark.executor.memory to a higher number and
>> also you can change the spark.storage.memoryFraction whose default value is
>> 0.6 of your executor memory.
>> - Network will be a bottleneck if data is not available locally on one of
>> the worker and hence it has to collect it from others, which is a lot of
>> Serialization and data transfer across your cluster.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate 
>> wrote:
>>
>>> Hi there,
>>>
>>> I am trying to scale up the data size that my application is handling.
>>> This application is running on a cluster with 16 slave nodes. Each slave
>>> node has 60GB memory. It is running in standalone mode. The data is coming
>>> from HDFS that also in same local network.
>>>
>>> In order to have an understanding on how my program is running, I also
>>> had a Ganglia installed on the cluster. From previous run, I know the stage
>>> that taking longest time to run is counting word pairs (my RDD consists of
>>> sentences from a corpus). My goal is to identify the bottleneck of my
>>> application, then modify my program or hardware configurations according to
>>> that.
>>>
>>> Unfortunately, I didn't find too much information on Spark monitoring
>>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
>>> for application tuning from tasks perspective. Basically, his focus is on
>>> tasks that oddly slower than the average. However, it didn't solve my
>>> problem because there is no such tasks that run way slow than others in my
>>> case.
>>>
>>> So I tried to identify the bottleneck from hardware prospective. I want
>>> to know what the limitation of the cluster is. I think if the executers are
>>> running hard, either CPU, memory or network bandwidth (or maybe the
>>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>>> of cluster is no more than 50%, network utilization is high for several
>>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>>> the nodes with maximum memory usage is consuming around 6GB, while
>>> "spark.executor.memory" is set to be 20GB.
>>>
>>> I am very confused that the program is not running fast enough, while
>>> hardware resources are not in shortage. Could you please give me some hints
>>> about what decides the performance of a Spark application from hardware
>>> perspective?
>>>
>>> Thanks!
>>>
>>> Julaiti
>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Configration Problem? (need help to get Spark job executed)

2015-02-17 Thread Arush Kharbanda

n
> worker-20150214102534-devpeng-db-cassandra-2.devpeng
> (devpeng-db-cassandra-2.devpeng.x:57563) with 8 cores
> 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20150214103013-0001/0 on hostPort
> devpeng-db-cassandra-2.devpeng.:57563 with 8 cores, 512.0 MB RAM
> 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor added:
> app-20150214103013-0001/1 on
> worker-20150214102534-devpeng-db-cassandra-3.devpeng.-38773
> (devpeng-db-cassandra-3.devpeng.xx:38773) with 8 cores
> 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20150214103013-0001/1 on hostPort
> devpeng-db-cassandra-3.devpeng.xe:38773 with 8 cores, 512.0 MB RAM
> 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
> app-20150214103013-0001/0 is now LOADING
> 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
> app-20150214103013-0001/1 is now LOADING
> 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
> app-20150214103013-0001/0 is now RUNNING
> 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
> app-20150214103013-0001/1 is now RUNNING
> 15/02/14 10:30:13 INFO NettyBlockTransferService: Server created on 58200
> 15/02/14 10:30:13 INFO BlockManagerMaster: Trying to register BlockManager
> 15/02/14 10:30:13 INFO BlockManagerMasterActor: Registering block manager
> 192.168.2.103:58200 with 530.3 MB RAM, BlockManagerId(, ,
> 58200)
> 15/02/14 10:30:13 INFO BlockManagerMaster: Registered BlockManager
> 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: SchedulerBackend is
> ready for scheduling beginning after reached minRegisteredResourcesRatio:
> 0.0
> 15/02/14 10:30:14 INFO Cluster: New Cassandra host
> devpeng-db-cassandra-1.devpeng.gkh-setu.de/:9042 added
> 15/02/14 10:30:14 INFO Cluster: New Cassandra host /xxx:9042 added
> 15/02/14 10:30:14 INFO Cluster: New Cassandra host :9042 added
> 15/02/14 10:30:14 INFO CassandraConnector: Connected to Cassandra cluster:
> GKHDevPeng
> 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host xxx
> (DC1)
> 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host xxx
> (DC1)
> 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host 
> (DC1)
> 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host
> x (DC1)
> 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host
> x (DC1)
> 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host
> x (DC1)
> 15/02/14 10:30:15 INFO CassandraConnector: Disconnected from Cassandra
> cluster: GKHDevPeng
> 15/02/14 10:30:16 INFO SparkContext: Starting job: count at
> TestApp.scala:23
> 15/02/14 10:30:16 INFO DAGScheduler: Got job 0 (count at TestApp.scala:23)
> with 3 output partitions (allowLocal=false)
> 15/02/14 10:30:16 INFO DAGScheduler: Final stage: Stage 0(count at
> TestApp.scala:23)
> 15/02/14 10:30:16 INFO DAGScheduler: Parents of final stage: List()
> 15/02/14 10:30:16 INFO DAGScheduler: Missing parents: List()
> 15/02/14 10:30:16 INFO DAGScheduler: Submitting Stage 0 (CassandraRDD[0]
> at RDD at CassandraRDD.scala:49), which has no missing parents
> 15/02/14 10:30:16 INFO MemoryStore: ensureFreeSpace(4472) called with
> curMem=0, maxMem=556038881
> 15/02/14 10:30:16 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 4.4 KB, free 530.3 MB)
> 15/02/14 10:30:16 INFO MemoryStore: ensureFreeSpace(3082) called with
> curMem=4472, maxMem=556038881
> 15/02/14 10:30:16 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 3.0 KB, free 530.3 MB)
> 15/02/14 10:30:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on x  (size: 3.0 KB, free: 530.3 MB)
> 15/02/14 10:30:16 INFO BlockManagerMaster: Updated info of block
> broadcast_0_piece0
> 15/02/14 10:30:16 INFO SparkContext: Created broadcast 0 from broadcast at
> DAGScheduler.scala:838
> 15/02/14 10:30:16 INFO DAGScheduler: Submitting 3 missing tasks from Stage
> 0 (CassandraRDD[0] at RDD at CassandraRDD.scala:49)
> 15/02/14 10:30:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
> 15/02/14 10:30:31 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: hive-thriftserver maven artifact

2015-02-16 Thread Arush Kharbanda

You can build your own spark with option -Phive-thriftserver.

You can publish the jars locally. I hope that would solve your problem.

On Mon, Feb 16, 2015 at 8:54 PM, Marco  wrote:

> Ok, so will it be only available for the next version (1.30)?
>
> 2015-02-16 15:24 GMT+01:00 Ted Yu :
>
>> I searched for 'spark-hive-thriftserver_2.10' on this page:
>> http://mvnrepository.com/artifact/org.apache.spark
>>
>> Looks like it is not published.
>>
>> On Mon, Feb 16, 2015 at 5:44 AM, Marco  wrote:
>>
>>> Hi,
>>>
>>> I am referring to https://issues.apache.org/jira/browse/SPARK-4925
>>> (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the
>>> artifact in a public repository ? I have not found it @Maven Central.
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>>
>
>
> --
> Viele Grüße,
> Marco
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Question about spark streaming+Flume

2015-02-16 Thread Arush Kharbanda

Hi

Can you try this

val lines = FlumeUtils.createStream(ssc,"localhost",)
// Print out the count of events received from this server in each
batch
   lines.count().map(cnt => "Received " + cnt + " flume events. at " +
System.currentTimeMillis() )

lines.forechRDD(_.foreach(println))

Thanks
Arush

On Tue, Feb 17, 2015 at 11:17 AM, bit1...@163.com  wrote:

> Hi,
> I am trying Spark Streaming + Flume example:
>
> 1. Code
> object SparkFlumeNGExample {
>def main(args : Array[String]) {
>val conf = new SparkConf().setAppName("SparkFlumeNGExample")
>val ssc = new StreamingContext(conf, Seconds(10))
>
>val lines = FlumeUtils.createStream(ssc,"localhost",)
> // Print out the count of events received from this server in each
> batch
>lines.count().map(cnt => "Received " + cnt + " flume events. at " +
> System.currentTimeMillis() ).print()
>ssc.start()
>ssc.awaitTermination();
> }
> }
> 2. I submit the application with following sh:
> ./spark-submit --deploy-mode client --name SparkFlumeEventCount --master
> spark://hadoop.master:7077 --executor-memory 512M --total-executor-cores 2
> --class spark.examples.streaming.SparkFlumeNGWordCount
> spark-streaming-flume.jar
>
>
> When I write data to flume, I only notice the following console
> information that input is added.
> storage.BlockManagerInfo: Added input-0-1424151807400 in memory on
> localhost:39338 (size: 1095.0 B, free: 267.2 MB)
> 15/02/17 00:43:30 INFO scheduler.JobScheduler: Added jobs for time
> 142415181 ms
> 15/02/17 00:43:40 INFO scheduler.JobScheduler: Added jobs for time
> 142415182 ms
> 
> 15/02/17 00:43:40 INFO scheduler.JobScheduler: Added jobs for time
> 142415187 ms
>
> But I didn't the output from the code: "Received X flumes events"
>
> I am no idea where the problem is, any idea? Thanks
>
>
> --
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Concurrent batch processing

2015-02-12 Thread Arush Kharbanda

It could depend on the nature of your application but spark streaming would
use spark internally and concurrency should be there what is your use case?


Are you sure that your configuration is good?


On Fri, Feb 13, 2015 at 1:17 AM, Matus Faro  wrote:

> Hi,
>
> Please correct me if I'm wrong, in Spark Streaming, next batch will
> not start processing until the previous batch has completed. Is there
> any way to be able to start processing the next batch if the previous
> batch is taking longer to process than the batch interval?
>
> The problem I am facing is that I don't see a hardware bottleneck in
> my Spark cluster, but Spark is not able to handle the amount of data I
> am pumping through (batch processing time is longer than batch
> interval). What I'm seeing is spikes of CPU, network and disk IO usage
> which I assume are due to different stages of a job, but on average,
> the hardware is under utilized. Concurrency in batch processing would
> allow the average batch processing time to be greater than batch
> interval while fully utilizing the hardware.
>
> Any ideas on what can be done? One option I can think of is to split
> the application into multiple applications running concurrently and
> dividing the initial stream of data between those applications.
> However, I would have to lose the benefits of having a single
> application.
>
> Thank you,
> Matus
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda

How many nodes do you have in your cluster, how many cores, what is the
size of the memory?

On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar 
wrote:

> Hi Arush,
>  Mine is a CDH5.3 with Spark 1.2.
> The only change to my spark programs are
> -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000.
>
> ..Manas
>
> On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> What is your cluster configuration? Did you try looking at the Web UI?
>> There are many tips here
>>
>> http://spark.apache.org/docs/1.2.0/tuning.html
>>
>> Did you try these?
>>
>> On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar 
>> wrote:
>>
>>> Hi,
>>>  I have a Hidden Markov Model running with 200MB data.
>>>  Once the program finishes (i.e. all stages/jobs are done) the program
>>> hangs for 20 minutes or so before killing master.
>>>
>>> In the spark master the following log appears.
>>>
>>> 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal
>>> error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting
>>> down ActorSystem [sparkMaster]
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> at scala.collection.immutable.List$.newBuilder(List.scala:396)
>>> at
>>> scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69)
>>> at
>>> scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105)
>>> at
>>> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58)
>>> at
>>> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53)
>>> at
>>> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>>> at
>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>>> at
>>> scala.collection.AbstractTraversable.map(Traversable.scala:105)
>>> at
>>> org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
>>> at
>>> org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
>>> at
>>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>>> at
>>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>>> at scala.collection.immutable.List.foreach(List.scala:318)
>>> at
>>> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>>> at
>>> scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>>> at org.json4s.MonadicJValue.org
>>> $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
>>> at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
>>> at
>>> org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450)
>>> at
>>> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423)
>>> at
>>> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
>>> at
>>> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at
>>> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
>>> at
>>> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> at
>>> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>>> at
>>> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
>>> at
>>> org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726)
>>> at
>>> org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675)
>>> at
>>> org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653)
>>> at
>>> org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399)
>>>
>>> Can anyone help?
>>>
>>> ..Manas
>>>
>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda

What is your cluster configuration? Did you try looking at the Web UI?
There are many tips here

http://spark.apache.org/docs/1.2.0/tuning.html

Did you try these?

On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar 
wrote:

> Hi,
>  I have a Hidden Markov Model running with 200MB data.
>  Once the program finishes (i.e. all stages/jobs are done) the program
> hangs for 20 minutes or so before killing master.
>
> In the spark master the following log appears.
>
> 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal
> error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting
> down ActorSystem [sparkMaster]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at scala.collection.immutable.List$.newBuilder(List.scala:396)
> at
> scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69)
> at
> scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105)
> at
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58)
> at
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53)
> at
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
> at
> org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at
> scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at org.json4s.MonadicJValue.org
> $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
> at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
> at
> org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450)
> at
> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423)
> at
> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
> at
> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
> at
> org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> at
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
> at
> org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726)
> at
> org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675)
> at
> org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653)
> at
> org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399)
>
> Can anyone help?
>
> ..Manas
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: 8080 port password protection

2015-02-12 Thread Arush Kharbanda

You could apply a password using a filter using a server. Though it dosnt
looks like the right grp for the question. It can be done for spark also
for Spark UI.

On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux) <
master.z...@gmail.com> wrote:

> Hi everyone,
>
> Im creating a development machine in AWS and i would like to protect the
> port 8080 using a password.
>
> Is it possible?
>
>
> Best Regards
>
> *Jairo Moreno*
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark Streaming distributed batch locking

2015-02-12 Thread Arush Kharbanda

* We have an inbound stream of sensor data for millions of devices (which
have unique identifiers). Spark Streaming can handel events in the ballpark
of 100-500K records/sec/node - *so you need to decide on a cluster
accordingly. And its scalable.*

* We need to perform aggregation of this stream on a per device level.
The aggregation will read data that has already been processed (and
persisted) in previous batches. - *You need to do stateful stream
processing, Spark streaming allows you to do that checkout - **updateStateByKey
-**http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
<http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html>*

* Key point:  When we process data for a particular device we need to
ensure that no other processes are processing data for that particular
device.  This is because the outcome of our processing will affect the
downstream processing for that device.  Effectively we need a distributed
lock. - *You can make the source device as a key and then updateStateByKey
in spark using the key.*

* In addition the event device data needs to be processed in the order
that the events occurred. - *You would need to implement this in your code
adding timestamp as a data item. Spark Streaming dosnt ensure in order
delivery of your event.*

On Thu, Feb 12, 2015 at 4:51 PM, Legg John  wrote:

> Hi
>
> After doing lots of reading and building a POC for our use case we are
> still unsure as to whether Spark Streaming can handle our use case:
>
> * We have an inbound stream of sensor data for millions of devices (which
> have unique identifiers).
> * We need to perform aggregation of this stream on a per device level.
> The aggregation will read data that has already been processed (and
> persisted) in previous batches.
> * Key point:  When we process data for a particular device we need to
> ensure that no other processes are processing data for that particular
> device.  This is because the outcome of our processing will affect the
> downstream processing for that device.  Effectively we need a distributed
> lock.
> * In addition the event device data needs to be processed in the order
> that the events occurred.
>
> Essentially we can¹t have two batches for the same device being processed
> at the same time.
>
> Can Spark handle our use case?
>
> Any advice appreciated.
>
> Regards
> John
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Question related to Spark SQL

2015-02-11 Thread Arush Kharbanda

I am implementing this approach currently.

A
1.Create data tables in spark-sql and cache them.
2. Configure the hive metastore to read the cached tables and share the
same metastore as spark-sql (You get the spark caching advantage)
3.Run spark code to fetch form the cached tables. In the spark code you can
genrate queries at runtime.


On Wed, Feb 11, 2015 at 4:12 PM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Hi,
>
> I am planning to use Spark for a Web-based adhoc reporting tool on massive
> date-sets on S3. Real-time queries with filters, aggregations and joins
> could be constructed from UI selections.
>
> Online documentation seems to suggest that SharkQL is deprecated and users
> should move away from it.  I understand Hive is generally not used for
> real-time querying and for Spark SQL to work with other data stores, tables
> need to be registered explicitly in code. Also, the This would not be
> suitable for a dynamic query construction scenario.
>
> For a real-time , dynamic querying scenario like mine what is the proper
> tool to be used with Spark SQL?
>
> Regards,
> Ashish
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Arush Kharbanda

Hi

I used this, though its using a embedded driver and is not a good
approch.It works. You can configure for some other metastore type also. I
have not tried the metastore uri's.






  javax.jdo.option.ConnectionURL


jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB






  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver








On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:

> Hi Arush,
>
> So yes I want to create the tables through Spark SQL.  I have placed the
> hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that
> was all I should need to do to have the thriftserver use it.  Perhaps my
> hive-site.xml is worng, it currently looks like this:
>
> 
> 
>   hive.metastore.uris
>   
>   thrift://sandbox.hortonworks.com:9083
>   URI for client to contact metastore server
> 
> 
>
> Which leads me to believe it is going to pull form the thriftserver from
> Horton?  I will go look at the docs to see if this is right, it is what
> Horton says to do.  Do you have an example hive-site.xml by chance that
> works with Spark SQL?
>
> I am using 8.3 of tableau with the SparkSQL Connector.
>
> Thanks for the assistance.
>
> -Todd
>
> On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> BTW what tableau connector are you using?
>>
>> On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>>  I am a little confused here, why do you want to create the tables in
>>> hive. You want to create the tables in spark-sql, right?
>>>
>>> If you are not able to find the same tables through tableau then thrift
>>> is connecting to a diffrent metastore than your spark-shell.
>>>
>>> One way to specify a metstore to thrift is to provide the path to
>>> hive-site.xml while starting thrift using --files hive-site.xml.
>>>
>>> similarly you can specify the same metastore to your spark-submit or
>>> sharp-shell using the same option.
>>>
>>>
>>>
>>> On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
>>>
>>>> Arush,
>>>>
>>>> As for #2 do you mean something like this from the docs:
>>>>
>>>> // sc is an existing SparkContext.val sqlContext = new 
>>>> org.apache.spark.sql.hive.HiveContext(sc)
>>>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
>>>> STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
>>>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>>> // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, 
>>>> value").collect().foreach(println)
>>>>
>>>> Or did you have something else in mind?
>>>>
>>>> -Todd
>>>>
>>>>
>>>> On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
>>>>
>>>>> Arush,
>>>>>
>>>>> Thank you will take a look at that approach in the morning.  I sort of
>>>>> figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
>>>>> for clarifying it for me.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <
>>>>> ar...@sigmoidanalytics.com> wrote:
>>>>>
>>>>>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>>>>>> JSON files? NO
>>>>>> 2.  Do I need to do something to expose these via hive / metastore
>>>>>> other than creating a table in hive? Create a table in spark sql to 
>>>>>> expose
>>>>>> via spark sql
>>>>>> 3.  Does the thriftserver need to be configured to expose these in
>>>>>> some fashion, sort of related to question 2 you would need to configure
>>>>>> thrift to read from the metastore you expect it read from - by default it
>>>>>> reads from metastore_db directory present in the directory used to launch
>>>>>> the thrift server.
>>>>>>  On 11 Feb 2015 01:35, "Todd Nist"  wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to understand how and what the Tableau connector to
>>>>>>> SparkSQL is able to access.  My understanding is it needs to connect to 
>>>>>>> the
&

Re: using spark in web services

2015-02-11 Thread Arush Kharbanda

Hi

Are you able to run the code after eliminating all the spark code. To find
out if the issue is with Jetty or with Spark itself, it could be also due
to conflicting jetty versions in spark and the one you are trying to use.

You can check the dependency graph in maven and check if there are any
dependency conflicts.

mvn dependency:tree -Dverbose


Thanks
Arush

On Mon, Feb 9, 2015 at 4:00 PM, Hafiz Mujadid 
wrote:

> Hi experts! I am trying to use spark in my restful webservices.I am using
> scala lift frramework for writing web services. Here is my boot class
> class Boot extends Bootable {
>   def boot {
> Constants.loadConfiguration
> val sc=new SparkContext(new
> SparkConf().setMaster("local").setAppName("services"))
> // Binding Service as a Restful API
> LiftRules.statelessDispatchTable.append(RestfulService);
> // resolve the trailing slash issue
> LiftRules.statelessRewrite.prepend({
>   case RewriteRequest(ParsePath(path, _, _, true), _, _) if path.last
> ==
> "index" => RewriteResponse(path.init)
> })
>
>   }
> }
>
>
> When i remove this line val sc=new SparkContext(new
> SparkConf().setMaster("local").setAppName("services"))
>
> then it works fine.
> I am starting services using command
>
> java -jar start.jar jetty.port=
>
> and get following exception
>
>
> ERROR net.liftweb.http.provider.HTTPProvider - Failed to Boot! Your
> application may not run properly
> java.lang.NoClassDefFoundError:
> org/eclipse/jetty/server/handler/ContextHandler$NoContext
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.newServletHandler(ServletContextHandler.java:260)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.getServletHandler(ServletContextHandler.java:322)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.relinkHandlers(ServletContextHandler.java:198)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:157)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:135)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
> at
>
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:99)
> at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:96)
> at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:87)
> at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
> at
> org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
> at
> org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
> at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:50)
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:63)
>
>
>
> Any suggestion please?
>
> Am I using right command to run this ?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-in-web-services-tp21550.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Lost task - connection closed

2015-02-11 Thread Arush Kharbanda

elds(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>> at
>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>> at
>>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> at
>>>
>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>>
>>> If you have any pointers for me on how to debug this, that would be very
>>> useful. I tried running with both spark 1.2.0 and 1.1.1, getting the same
>>> error.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p21371.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda

BTW what tableau connector are you using?

On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda <
ar...@sigmoidanalytics.com> wrote:

>  I am a little confused here, why do you want to create the tables in
> hive. You want to create the tables in spark-sql, right?
>
> If you are not able to find the same tables through tableau then thrift is
> connecting to a diffrent metastore than your spark-shell.
>
> One way to specify a metstore to thrift is to provide the path to
> hive-site.xml while starting thrift using --files hive-site.xml.
>
> similarly you can specify the same metastore to your spark-submit or
> sharp-shell using the same option.
>
>
>
> On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
>
>> Arush,
>>
>> As for #2 do you mean something like this from the docs:
>>
>> // sc is an existing SparkContext.val sqlContext = new 
>> org.apache.spark.sql.hive.HiveContext(sc)
>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
>> STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>> // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, 
>> value").collect().foreach(println)
>>
>> Or did you have something else in mind?
>>
>> -Todd
>>
>>
>> On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
>>
>>> Arush,
>>>
>>> Thank you will take a look at that approach in the morning.  I sort of
>>> figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
>>> for clarifying it for me.
>>>
>>> -Todd
>>>
>>> On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <
>>> ar...@sigmoidanalytics.com> wrote:
>>>
>>>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>>>> JSON files? NO
>>>> 2.  Do I need to do something to expose these via hive / metastore
>>>> other than creating a table in hive? Create a table in spark sql to expose
>>>> via spark sql
>>>> 3.  Does the thriftserver need to be configured to expose these in some
>>>> fashion, sort of related to question 2 you would need to configure thrift
>>>> to read from the metastore you expect it read from - by default it reads
>>>> from metastore_db directory present in the directory used to launch the
>>>> thrift server.
>>>>  On 11 Feb 2015 01:35, "Todd Nist"  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to understand how and what the Tableau connector to
>>>>> SparkSQL is able to access.  My understanding is it needs to connect to 
>>>>> the
>>>>> thriftserver and I am not sure how or if it exposes parquet, json,
>>>>> schemaRDDs, or does it only expose schemas defined in the metastore / 
>>>>> hive.
>>>>>
>>>>>
>>>>> For example, I do the following from the spark-shell which generates a
>>>>> schemaRDD from a csv file and saves it as a JSON file as well as a parquet
>>>>> file.
>>>>>
>>>>> import *org.apache.sql.SQLContext
>>>>> *import com.databricks.spark.csv._
>>>>> val sqlContext = new SQLContext(sc)
>>>>> val test = 
>>>>> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
>>>>> test.saveAsParquetFile("/data/out")
>>>>>
>>>>> When I connect from Tableau, the only thing I see is the "default"
>>>>> schema and nothing in the tables section.
>>>>>
>>>>> So my questions are:
>>>>>
>>>>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>>>>> JSON files?
>>>>> 2.  Do I need to do something to expose these via hive / metastore
>>>>> other than creating a table in hive?
>>>>> 3.  Does the thriftserver need to be configured to expose these in
>>>>> some fashion, sort of related to question 2.
>>>>>
>>>>> TIA for the assistance.
>>>>>
>>>>> -Todd
>>>>>
>>>>
>>>
>>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Error while querying hive table from spark shell

2015-02-10 Thread Arush Kharbanda

thodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:62)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
> ... 88 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
> ... 93 more
> Caused by: java.lang.NoClassDefFoundError:
> org/datanucleus/exceptions/NucleusException
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018)
> at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
> at
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
> at
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58)
> at
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:497)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:475)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:356)
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:171)
> ... 98 more
> Caused by: java.lang.ClassNotFoundException:
> org.datanucleus.exceptions.NucleusException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 124 more
>
>
>
> Regards,
> Kundan
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda

 I am a little confused here, why do you want to create the tables in hive.
You want to create the tables in spark-sql, right?

If you are not able to find the same tables through tableau then thrift is
connecting to a diffrent metastore than your spark-shell.

One way to specify a metstore to thrift is to provide the path to
hive-site.xml while starting thrift using --files hive-site.xml.

similarly you can specify the same metastore to your spark-submit or
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:

> Arush,
>
> As for #2 do you mean something like this from the docs:
>
> // sc is an existing SparkContext.val sqlContext = new 
> org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
> STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> // Queries are expressed in HiveQLsqlContext.sql("FROM src SELECT key, 
> value").collect().foreach(println)
>
> Or did you have something else in mind?
>
> -Todd
>
>
> On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
>
>> Arush,
>>
>> Thank you will take a look at that approach in the morning.  I sort of
>> figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
>> for clarifying it for me.
>>
>> -Todd
>>
>> On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>>> JSON files? NO
>>> 2.  Do I need to do something to expose these via hive / metastore other
>>> than creating a table in hive? Create a table in spark sql to expose via
>>> spark sql
>>> 3.  Does the thriftserver need to be configured to expose these in some
>>> fashion, sort of related to question 2 you would need to configure thrift
>>> to read from the metastore you expect it read from - by default it reads
>>> from metastore_db directory present in the directory used to launch the
>>> thrift server.
>>>  On 11 Feb 2015 01:35, "Todd Nist"  wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to understand how and what the Tableau connector to SparkSQL
>>>> is able to access.  My understanding is it needs to connect to the
>>>> thriftserver and I am not sure how or if it exposes parquet, json,
>>>> schemaRDDs, or does it only expose schemas defined in the metastore / hive.
>>>>
>>>>
>>>> For example, I do the following from the spark-shell which generates a
>>>> schemaRDD from a csv file and saves it as a JSON file as well as a parquet
>>>> file.
>>>>
>>>> import *org.apache.sql.SQLContext
>>>> *import com.databricks.spark.csv._
>>>> val sqlContext = new SQLContext(sc)
>>>> val test = 
>>>> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
>>>> test.saveAsParquetFile("/data/out")
>>>>
>>>> When I connect from Tableau, the only thing I see is the "default"
>>>> schema and nothing in the tables section.
>>>>
>>>> So my questions are:
>>>>
>>>> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
>>>> JSON files?
>>>> 2.  Do I need to do something to expose these via hive / metastore
>>>> other than creating a table in hive?
>>>> 3.  Does the thriftserver need to be configured to expose these in some
>>>> fashion, sort of related to question 2.
>>>>
>>>> TIA for the assistance.
>>>>
>>>> -Todd
>>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda

1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON
files? NO
2.  Do I need to do something to expose these via hive / metastore other
than creating a table in hive? Create a table in spark sql to expose via
spark sql
3.  Does the thriftserver need to be configured to expose these in some
fashion, sort of related to question 2 you would need to configure thrift
to read from the metastore you expect it read from - by default it reads
from metastore_db directory present in the directory used to launch the
thrift server.
 On 11 Feb 2015 01:35, "Todd Nist"  wrote:

> Hi,
>
> I'm trying to understand how and what the Tableau connector to SparkSQL is
> able to access.  My understanding is it needs to connect to the
> thriftserver and I am not sure how or if it exposes parquet, json,
> schemaRDDs, or does it only expose schemas defined in the metastore / hive.
>
>
> For example, I do the following from the spark-shell which generates a
> schemaRDD from a csv file and saves it as a JSON file as well as a parquet
> file.
>
> import *org.apache.sql.SQLContext
> *import com.databricks.spark.csv._
> val sqlContext = new SQLContext(sc)
> val test = 
> sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
> test.saveAsParquetFile("/data/out")
>
> When I connect from Tableau, the only thing I see is the "default" schema
> and nothing in the tables section.
>
> So my questions are:
>
> 1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON
> files?
> 2.  Do I need to do something to expose these via hive / metastore other
> than creating a table in hive?
> 3.  Does the thriftserver need to be configured to expose these in some
> fashion, sort of related to question 2.
>
> TIA for the assistance.
>
> -Todd
>

Re: ZeroMQ and pyspark.streaming

2015-02-10 Thread Arush Kharbanda

No, zeromq api is not supported in python as of now.
On 5 Feb 2015 21:27, "Sasha Kacanski"  wrote:

> Does pyspark supports zeroMQ?
> I see that java does it, but I am not sure for Python?
> regards
>
> --
> Aleksandar Kacanski
>

Re: Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-02-09 Thread Arush Kharbanda

Is this what you are looking for

   1. Build Spark with the YARN profile
   <http://spark.apache.org/docs/1.2.0/building-spark.html>. Skip this step
   if you are using a pre-packaged distribution.
   2. Locate the spark--yarn-shuffle.jar. This should be under
   $SPARK_HOME/network/yarn/target/scala- if you are building
   Spark yourself, and under lib if you are using a distribution.
   3. Add this jar to the classpath of all NodeManagers in your cluster.
   4. In the yarn-site.xml on each node, add spark_shuffle to
   yarn.nodemanager.aux-services, then set
   yarn.nodemanager.aux-services.spark_shuffle.class to
   org.apache.spark.network.yarn.YarnShuffleService. Additionally, set all
   relevantspark.shuffle.service.* configurations
   <http://spark.apache.org/docs/1.2.0/configuration.html>.
   5. Restart all NodeManagers in your cluster.

On Wed, Jan 28, 2015 at 1:30 AM, Corey Nolet  wrote:

> I've read that this is supposed to be a rather significant optimization to
> the shuffle system in 1.1.0 but I'm not seeing much documentation on
> enabling this in Yarn. I see github classes for it in 1.2.0 and a property
> "spark.shuffle.service.enabled" in the spark-defaults.conf.
>
> The code mentions that this is supposed to be run inside the Nodemanager
> so I'm assuming it needs to be wired up in the yarn-site.xml under the
> "yarn.nodemanager.aux-services" property?
>
>
>
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: checking

2015-02-06 Thread Arush Kharbanda

Yes they are.

On Fri, Feb 6, 2015 at 5:06 PM, Mohit Durgapal 
wrote:

> Just wanted to know If my emails are reaching the user list.
>
>
> Regards
> Mohit
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark job ends abruptly during setup without error message

2015-02-05 Thread Arush Kharbanda

Are you submitting the job from your local machine or on the driver
machine.?

Have you set YARN_CONF_DIR.

On Thu, Feb 5, 2015 at 10:43 PM, Arun Luthra  wrote:

> While a spark-submit job is setting up, the yarnAppState goes into Running
> mode, then I get a flurry of typical looking INFO-level messages such as
>
> INFO BlockManagerMasterActor: ...
> INFO YarnClientSchedulerBackend: Registered executor:  ...
>
> Then, spark-submit quits without any error message and I'm back at the
> command line. What could be causing this?
>
> Arun
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Arush Kharbanda

You can use akka, that is the underlying Multithreading library Spark uses.

On Thu, Feb 5, 2015 at 9:56 PM, Shuai Zheng  wrote:

> Nice. I just try and it works. Thanks very much!
>
> And I notice there is below in the log:
>
> 15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@NY02913D.global.local:8162]
> 15/02/05 11:19:10 INFO AkkaUtils: Connecting to HeartbeatReceiver:
> akka.tcp://sparkDriver@NY02913D.global.local:8162/user/HeartbeatReceiver
>
> As I understand. The local mode will have driver and executors in the same
> java process. So is there any way for me to also disable above two
> listeners? Or they are not optional even in local mode?
>
> Regards,
>
> Shuai
>
>
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Thursday, February 05, 2015 10:53 AM
> To: Shuai Zheng
> Cc: user@spark.apache.org
> Subject: Re: Use Spark as multi-threading library and deprecate web UI
>
> Do you mean disable the web UI? spark.ui.enabled=false
>
> Sure, it's useful with master = local[*] too.
>
> On Thu, Feb 5, 2015 at 9:30 AM, Shuai Zheng  wrote:
> > Hi All,
> >
> >
> >
> > It might sounds weird, but I think spark is perfect to be used as a
> > multi-threading library in some cases. The local mode will naturally
> > boost multiple thread when required. Because it is more restrict and
> > less chance to have potential bug in the code (because it is more data
> > oriental, not thread oriental). Of course, it cannot be used for all
> > cases, but in most of my applications, it is enough (90%).
> >
> >
> >
> > I want to hear other people’s idea about this.
> >
> >
> >
> > BTW: if I run spark in local mode, how to deprecate the web UI
> > (default listen on 4040), because I don’t want to start the UI every
> > time if I use spark as a local library.
> >
> >
> >
> > Regards,
> >
> >
> >
> > Shuai
>
>
> ---------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: how to specify hive connection options for HiveContext

2015-02-05 Thread Arush Kharbanda

Hi

Are you trying to run a spark job from inside eclipse? and want the job to
access hive configuration options.? To  access hive tables?

Thanks
Arush

On Tue, Feb 3, 2015 at 7:24 AM, guxiaobo1982  wrote:

> Hi,
>
> I know two options, one for spark_submit, the other one for spark-shell,
> but how to set for programs running inside eclipse?
>
> Regards,
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Errors in the workers machines

2015-02-05 Thread Arush Kharbanda

1. For what reasons is using Spark the above ports? What internal component
is triggering them? -Akka(guessing from the error log)  is used to schedule
tasks and to notify executors - the ports used are random by default
2. How I can get rid of these errors? - Probably the ports are not open on
your server.You can set certain ports and open them using  spark.driver.port
and spark.executor.port. Or you can open all ports between the masters and
slaves.
for a cluster on ec2, the ec2 script takes care of the required.

3. Why the application is still finished with success? - DO you have more
worker in the cluster which are able to connect.
4. Why is trying with more ports? - Not sure, Its picking the ports
randomly.

On Thu, Feb 5, 2015 at 2:30 PM, Spico Florin  wrote:

> Hello!
>  I received the following errors in the workerLog.log files:
>
> ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660]
> -> [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed
> with [akka.tcp://sparkExecutor@stream4:47929]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@stream4:47929]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: stream4/x.x.x.x:47929
> ]
> (For security reason  have masked the IP with x.x.x.x). The same errors
> occurs for different ports
> (42395,39761).
> Even though I have these errors the application is finished with success.
> I have the following questions:
> 1. For what reasons is using Spark the above ports? What internal
> component is triggering them?
> 2. How I can get rid of these errors?
> 3. Why the application is still finished with success?
> 4. Why is trying with more ports?
>
> I look forward for your answers.
>   Regards.
>  Florin
>
>
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Union in Spark

2015-02-01 Thread Arush Kharbanda

Hi Deep,

What is your configuration and what is the size of the 2 data sets?

Thanks
Arush

On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan 
wrote:

> I did not check the console because once the job starts I cannot run
> anything else and have to force shutdown the system. I commented parts of
> codes and I tested. I doubt it is because of union. So, I want to change it
> to something else and see if the problem persists.
>
> Thank you
>
> On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam  wrote:
>
>> Hi Deep,
>>
>> How do you know the cluster is not responsive because of "Union"?
>> Did you check the spark web console?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan 
>> wrote:
>>
>>> The cluster hangs.
>>>
>>> On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam  wrote:
>>>
>>>> Hi Deep,
>>>>
>>>> what do you mean by stuck?
>>>>
>>>> Jerry
>>>>
>>>>
>>>> On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan <
>>>> pradhandeep1...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> Is there any better operation than Union. I am using union and the
>>>>> cluster is getting stuck with a large data set.
>>>>>
>>>>> Thank you
>>>>>
>>>>
>>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Error when running spark in debug mode

2015-01-31 Thread Arush Kharbanda

Can you share your log4j file.

On Sat, Jan 31, 2015 at 1:35 PM, Arush Kharbanda  wrote:

> Hi Ankur,
>
> Its running fine for me for spark 1.1 and changes to log4j properties
> file.
>
> Thanks
> Arush
>
> On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi Arush
>>
>> I have configured log4j by updating the file log4j.properties in
>> SPARK_HOME/conf folder.
>>
>> If it was a log4j defect we would get error in debug mode in all apps.
>>
>> Thanks
>> Ankur
>>  Hi Ankur,
>>
>> How are you enabling the debug level of logs. It should be a log4j
>> configuration. Even if there would be some issue it would be in log4j and
>> not in spark.
>>
>> Thanks
>> Arush
>>
>> On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava <
>> ankur.srivast...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> When ever I enable DEBUG level logs for my spark cluster, on running a
>>> job all the executors die with the below exception. On disabling the DEBUG
>>> logs my jobs move to the next step.
>>>
>>>
>>> I am on spark-1.1.0
>>>
>>> Is this a known issue with spark?
>>>
>>> Thanks
>>> Ankur
>>>
>>> 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
>>> SecurityManager: authentication disabled; ui acls disabled; users with view
>>> permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
>>>
>>> 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils -
>>> In createActorSystem, requireCookie is: off
>>>
>>> 2015-01-29 22:27:42,871
>>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
>>> akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>>>
>>> 2015-01-29 22:27:42,912
>>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>>> Starting remoting
>>>
>>> 2015-01-29 22:27:43,057
>>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>>> Remoting started; listening on addresses :[akka.tcp://
>>> driverPropsFetcher@10.77.9.155:36035]
>>>
>>> 2015-01-29 22:27:43,060
>>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>>> Remoting now listens on addresses: [akka.tcp://
>>> driverPropsFetcher@10.77.9.155:36035]
>>>
>>> 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
>>> Successfully started service 'driverPropsFetcher' on port 36035.
>>>
>>> 2015-01-29 22:28:13,077 [main] ERROR
>>> org.apache.hadoop.security.UserGroupInformation -
>>> PriviledgedActionException as:ubuntu
>>> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
>>> seconds]
>>>
>>> Exception in thread "main"
>>> java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
>>>
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>>>
>>> at
>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>>>
>>> at
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>>>
>>> at
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>>>
>>> at
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>>>
>>> Caused by: java.security.PrivilegedActionException:
>>> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>>>
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>>>
>>> ... 4 more
>>>
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
>>> after [30 seconds]
>>>
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>>
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>>
>>> at
>>> scala.concurrent.Await$$anonfun$result$1.apply(

Re: Spark streaming - tracking/deleting processed files

2015-01-31 Thread Arush Kharbanda

Hi Ganterm,

Thats obvious. If you look at the documentation for textFileStream.

Create a input stream that monitors a Hadoop-compatible filesystem for new
files and reads them as text files (using key as LongWritable, value as
Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.

You need to move files to the directory when the system is up, You need to
manage that using shell script.  Moving files one at a time, moving them
out via another script.

On Fri, Jan 30, 2015 at 11:37 PM, ganterm  wrote:

> We are running a Spark streaming job that retrieves files from a directory
> (using textFileStream).
> One concern we are having is the case where the job is down but files are
> still being added to the directory.
> Once the job starts up again, those files are not being picked up (since
> they are not new or changed while the job is running) but we would like
> them
> to be processed.
> Is there a solution for that? Is there a way to keep track what files have
> been processed and can we "force" older files to be picked up? Is there a
> way to delete the processed files?
>
> Thanks!
> Markus
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Error when running spark in debug mode

2015-01-31 Thread Arush Kharbanda

Hi Ankur,

Its running fine for me for spark 1.1 and changes to log4j properties file.

Thanks
Arush

On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Arush
>
> I have configured log4j by updating the file log4j.properties in
> SPARK_HOME/conf folder.
>
> If it was a log4j defect we would get error in debug mode in all apps.
>
> Thanks
> Ankur
>  Hi Ankur,
>
> How are you enabling the debug level of logs. It should be a log4j
> configuration. Even if there would be some issue it would be in log4j and
> not in spark.
>
> Thanks
> Arush
>
> On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi,
>>
>> When ever I enable DEBUG level logs for my spark cluster, on running a
>> job all the executors die with the below exception. On disabling the DEBUG
>> logs my jobs move to the next step.
>>
>>
>> I am on spark-1.1.0
>>
>> Is this a known issue with spark?
>>
>> Thanks
>> Ankur
>>
>> 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
>> SecurityManager: authentication disabled; ui acls disabled; users with view
>> permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
>>
>> 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In
>> createActorSystem, requireCookie is: off
>>
>> 2015-01-29 22:27:42,871
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
>> akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>>
>> 2015-01-29 22:27:42,912
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>> Starting remoting
>>
>> 2015-01-29 22:27:43,057
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>> Remoting started; listening on addresses :[akka.tcp://
>> driverPropsFetcher@10.77.9.155:36035]
>>
>> 2015-01-29 22:27:43,060
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>> Remoting now listens on addresses: [akka.tcp://
>> driverPropsFetcher@10.77.9.155:36035]
>>
>> 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
>> Successfully started service 'driverPropsFetcher' on port 36035.
>>
>> 2015-01-29 22:28:13,077 [main] ERROR
>> org.apache.hadoop.security.UserGroupInformation -
>> PriviledgedActionException as:ubuntu
>> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
>> seconds]
>>
>> Exception in thread "main"
>> java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>>
>> Caused by: java.security.PrivilegedActionException:
>> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>>
>> ... 4 more
>>
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [30 seconds]
>>
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>
>> at
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>
>> at scala.concurrent.Await$.result(package.scala:107)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
>>
>> ... 7 more
>>
>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: groupByKey is not working

2015-01-30 Thread Arush Kharbanda

Hi Amit,

What error does it through?

Thanks
Arush

On Sat, Jan 31, 2015 at 1:50 AM, Amit Behera  wrote:

> hi all,
>
> my sbt file is like this:
>
> name := "Spark"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
>
> libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
>
>
> *code:*
>
> object SparkJob
> {
>
>   def pLines(lines:Iterator[String])={
> val parser=new CSVParser()
> lines.map(l=>{val vs=parser.parseLine(l)
>   (vs(0),vs(1).toInt)})
>   }
>
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Spark Job").setMaster("local")
> val sc = new SparkContext(conf)
> val data = sc.textFile("/home/amit/testData.csv").cache()
> val result = data.mapPartitions(pLines).groupByKey
> //val list = result.filter(x=> {(x._1).contains("24050881")})
>
>   }
>
> }
>
>
> Here groupByKey is not working . But same thing is working from *spark-shell.*
>
> Please help me
>
>
> Thanks
>
> Amit
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Building Spark behind a proxy

2015-01-30 Thread Arush Kharbanda

Hi Somya,

I meant when you configure the JAVA_OPTS and when you don't configure the
JAVA_OPTS is there any difference in the error message?

Are you facing the same issue when you built using maven?

Thanks
Arush

On Thu, Jan 29, 2015 at 10:22 PM, Soumya Simanta 
wrote:

> I can do a
> wget
> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
> and get the file successfully on a shell.
>
>
>
> On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas 
> wrote:
>
>> At least a part of it is due to connection refused, can you check if curl
>> can reach the URL with proxies -
>> [FATAL] Non-resolvable parent POM: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (
>> http://repo.maven.apache.org/maven2): Error transferring file:
>> Connection refused from
>> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
>>
>> On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta <
>> soumya.sima...@gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda <
>>> ar...@sigmoidanalytics.com> wrote:
>>>
>>>> Does  the error change on build with and without the built options?
>>>>
>>> What do you mean by build options? I'm just doing ./sbt/sbt assembly
>>> from $SPARK_HOME
>>>
>>>
>>>> Did you try using maven? and doing the proxy settings there.
>>>>
>>>
>>>  No I've not tried maven yet. However, I did set proxy settings inside
>>> my .m2/setting.xml, but it didn't make any difference.
>>>
>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Error when running spark in debug mode

2015-01-30 Thread Arush Kharbanda

Hi Ankur,

How are you enabling the debug level of logs. It should be a log4j
configuration. Even if there would be some issue it would be in log4j and
not in spark.

Thanks
Arush

On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi,
>
> When ever I enable DEBUG level logs for my spark cluster, on running a job
> all the executors die with the below exception. On disabling the DEBUG logs
> my jobs move to the next step.
>
>
> I am on spark-1.1.0
>
> Is this a known issue with spark?
>
> Thanks
> Ankur
>
> 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
> SecurityManager: authentication disabled; ui acls disabled; users with view
> permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
>
> 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In
> createActorSystem, requireCookie is: off
>
> 2015-01-29 22:27:42,871
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
> akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>
> 2015-01-29 22:27:42,912
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Starting remoting
>
> 2015-01-29 22:27:43,057
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Remoting started; listening on addresses :[akka.tcp://
> driverPropsFetcher@10.77.9.155:36035]
>
> 2015-01-29 22:27:43,060
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Remoting now listens on addresses: [akka.tcp://
> driverPropsFetcher@10.77.9.155:36035]
>
> 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
> Successfully started service 'driverPropsFetcher' on port 36035.
>
> 2015-01-29 22:28:13,077 [main] ERROR
> org.apache.hadoop.security.UserGroupInformation -
> PriviledgedActionException as:ubuntu
> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds]
>
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>
> ... 4 more
>
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [30 seconds]
>
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>
> at scala.concurrent.Await$.result(package.scala:107)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
>
> ... 7 more
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Building Spark behind a proxy

2015-01-29 Thread Arush Kharbanda

  at sbt.StandardMain$.runManaged(Main.scala:57)
> at sbt.xMain.run(Main.scala:29)
> at xsbt.boot.Launch$$anonfun$run$1.apply(Launch.scala:109)
> at xsbt.boot.Launch$.withContextLoader(Launch.scala:129)
> at xsbt.boot.Launch$.run(Launch.scala:109)
> at xsbt.boot.Launch$$anonfun$apply$1.apply(Launch.scala:36)
> at xsbt.boot.Launch$.launch(Launch.scala:117)
> at xsbt.boot.Launch$.apply(Launch.scala:19)
> at xsbt.boot.Boot$.runImpl(Boot.scala:44)
> at xsbt.boot.Boot$.main(Boot.scala:20)
> at xsbt.boot.Boot.main(Boot.scala)
> [error] org.apache.maven.model.building.ModelBuildingException: 1 problem
> was encountered while building the effective model for
> org.apache.spark:spark-parent:1.1.1
> [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (
> http://repo.maven.apache.org/maven2): Error transferring file: Connection
> refused from
> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
> and 'parent.relativePath' points at wrong local POM @ line 21, column 11
> [error] Use 'last' for the full log.
> Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: is there a master for spark cluster in ec2

2015-01-29 Thread Arush Kharbanda

Hi Mohit,

You can set the master instance type with -m.

To setup a cluster you need to use the ec2/spark-ec2 script.

You need to create a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your
aws web console under Security Credentials. And pass it on to script above.
Once you do that you should be able to setup your cluster using spark-ec2
options.

Thanks
Arush

On Thu, Jan 29, 2015 at 6:41 AM, Mohit Singh  wrote:

> Hi,
>   Probably a naive question.. But I am creating a spark cluster on ec2
> using the ec2 scripts in there..
> But is there a master param I need to set..
> ./bin/pyspark --master [ ] ??
> I don't yet fully understand the ec2 concepts so just wanted to confirm
> this??
> Thanks
>
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates
>

-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: unknown issue in submitting a spark job

2015-01-29 Thread Arush Kharbanda

mpl: Uncaught fatal error from
> thread [Driver-scheduler-1] shutting down ActorSystem [Driver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> akka.dispatch.AbstractNodeQueue.(AbstractNodeQueue.java:19)
> at
>
> akka.actor.LightArrayRevolverScheduler$TaskQueue.(Scheduler.scala:431)
> at
>
> akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:397)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
> at java.lang.Thread.run(Thread.java:745)
> 15/01/29 08:54:33 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(0, ip-10-10-8-191.us-west-2.compute.internal,
> 47722, 0) with no recent heart beats: 82575ms exceeds 45000ms
> 15/01/29 08:54:33 INFO spark.ContextCleaner: Cleaned RDD 1
> 15/01/29 08:54:33 WARN util.AkkaUtils: Error sending message in 1 attempts
> akka.pattern.AskTimeoutException:
> Recipient[Actor[akka://sparkDriver/user/BlockManagerMaster#-538003375]] had
> already been terminated.
> at
> akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
> at
> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:175)
> at
>
> org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218)
> at
>
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:126)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/unknown-issue-in-submitting-a-spark-job-tp21418.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-29 Thread Arush Kharbanda

Hi Sarwar,

For a quick fix you can exclude dependencies for yarn(you wont be needing
them if you are running locally).

libraryDependencies +=
  "log4j" % "log4j" % "1.2.15" exclude("javax.jms", "jms")


You can also analyze your dependencies using this plugin

https://github.com/jrudolph/sbt-dependency-graph


Thanks
Arush

On Thu, Jan 29, 2015 at 4:20 AM, Sarwar Bhuiyan 
wrote:

> Hello all,
>
> I'm trying to build the sample application on the spark 1.2.0 quickstart
> page (https://spark.apache.org/docs/latest/quick-start.html) using the
> following build.sbt file:
>
> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
>
> Upon calling sbt package, it downloaded a lot of dependencies but
> eventually failed with some warnings and errors. Here's the snippet:
>
> [warn]  ::
> [warn]  ::  UNRESOLVED DEPENDENCIES ::
> [warn]  ::
> [warn]  :: org.apache.hadoop#hadoop-yarn-common;1.0.4: not found
> [warn]  :: org.apache.hadoop#hadoop-yarn-client;1.0.4: not found
> [warn]  :: org.apache.hadoop#hadoop-yarn-api;1.0.4: not found
> [warn]  ::
> [warn]  ::
> [warn]  ::  FAILED DOWNLOADS::
> [warn]  :: ^ see resolution messages for details  ^ ::
> [warn]  ::
> [warn]  ::
> org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645!javax.transaction.orbit
> [warn]  ::
> org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit
> [warn]  ::
> org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v201005082020!javax.mail.glassfish.orbit
> [warn]  ::
> org.eclipse.jetty.orbit#javax.activation;1.1.0.v201105071233!javax.activation.orbit
> [warn]  ::
>
>
>
> Upon checking the maven repositories there doesn't seem to be any
> hadoop-yarn-common 1.0.4. I've tried explicitly setting a dependency to
> hadoop-yarn-common 2.4.0 for example but to no avail. I've also tried
> setting a number of different repositories to see if maybe one of them
> might have that dependency. Still no dice.
>
> What's the best way to resolve this for a quickstart situation? Do I have
> to set some sort of profile or environment variable which doesn't try to
> bring the 1.0.4 yarn version?
>
> Any help would be greatly appreciated.
>
> Sarwar
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread Arush Kharbanda

Spark SQL on Hive

1. The purpose of Spark SQL is to allow Spark users to selectively use SQL
expressions (with not a huge number of functions currently supported) when
writing Spark jobs
2. Already Available

Hive on Spark
1.Spark users will automatically get the whole set of Hive’s rich features,
including any new features that Hive might introduce in the future.
2. Under Development

On Thu, Jan 29, 2015 at 4:54 AM, ogoh  wrote:

>
> Hello,
> probably this question was already asked but still I'd like to confirm from
> Spark users.
>
> This following blog shows 'hive on spark' :
>
> http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/
> ".
> How is it different from using hive as data storage of SparkSQL
> (http://spark.apache.org/docs/latest/sql-programming-guide.html)?
> Also, is there any update about SparkSQL's next release (current one is
> still alpha)?
>
> Thanks,
> OGoh
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-vs-SparkSQL-using-Hive-tp21412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark serialization issues with third-party libraries

2014-11-24 Thread Arush Kharbanda

Hi

You can see my code here .

Its a POC to implement UIMA on spark

https://bitbucket.org/SigmoidDev/uimaspark

https://bitbucket.org/SigmoidDev/uimaspark/src/8476fdf16d84d0f517cce45a8bc1bd3410927464/UIMASpark/src/main/scala/
*UIMAProcessor.scala*?at=master

this is the class where the major part of the integration happens.

Thanks
Arush

On Sun, Nov 23, 2014 at 7:52 PM, jatinpreet  wrote:

> Thanks Sean, I was actually using instances created elsewhere inside my RDD
> transformations which as I understand is against Spark programming model. I
> was referred to a talk about UIMA and Spark integration from this year's
> Spark summit, which had a workaround for this problem. I just had to make
> some class members transient.
>
> http://spark-summit.org/2014/talk/leveraging-uima-in-spark
>
> Thanks
>
>
>
> -
> Novice Big Data Programmer
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

77 matches

Mail list logo