from:"Ashish"

Unsubscribe

2022-07-28 Thread Ashish



Unsubscribe
Sent from my iPhone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Problem of how to retrieve file from HDFS

2019-10-08 Thread Ashish Mittal

Hi,
I am trying to store and retrieve csv file from HDFS.but i have
successfully store csv file in HDFS using LinearRegressionModel in spark
using Java.but not retrieve csv file from HDFS. how to retrieve csv file
from HDFS.
code--
SparkSession sparkSession =
SparkSession.builder().appName("JavaSparkModelWithHadoopHDFSExample").master("local[2]").getOrCreate();
SQLContext sqlContext = new SQLContext(sparkSession);

VectorAssembler assembler = new VectorAssembler();
assembler.setInputCols(new String[] { "MONTH_1", "MONTH_2",
"MONTH_3", "MONTH_4", "MONTH_5", "MONTH_6" })
.setOutputCol("features");

Dataset rowDataSet =
sqlContext.read().format("csv").option("header",
"true").option("inferSchema", "true")

.load("hdfs://localhost:9000/user/hadoop/inpit/data/history.csv");
rowDataSet.show();
rowDataSet.printSchema();

Dataset vectorDataSet =
assembler.transform(rowDataSet).drop("CUST_ID");
vectorDataSet.show();

LinearRegression lr = new
LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
.setFeaturesCol("features").setLabelCol("CLV");
lr.setPredictionCol("prediction");

LinearRegressionModel lrModel = lr.fit(vectorDataSet);

lrModel.write().overwrite().save("hdfs://localhost:9000/user/hadoop/inpit/data/history.csv");

This code is successfully store csv file. but i don't know how to retrieve
csv file from hdfs. Please help me.

Thanks & Regards,
Ashish Mittal

Re: Spark Streaming to REST API

2017-12-21 Thread ashish rawat

Sorry, for not making it explicit. We are using Spark Streaming as the
streaming solution and I was wondering if it is a common pattern to do per
tuple redis read/write and write to a REST API through Spark Streaming.

Regards,
Ashish

On Fri, Dec 22, 2017 at 4:00 AM, Gourav Sengupta 
wrote:

> hi Ashish,
>
> I was just wondering if there is any particular reason why you are posting
> this to a SPARK group?
>
> Regards,
> Gourav
>
> On Thu, Dec 21, 2017 at 8:32 PM, ashish rawat  wrote:
>
>> Hi,
>>
>> We are working on a streaming solution where multiple out of order
>> streams are flowing in the system and we need to join the streams based on
>> a unique id. We are planning to use redis for this, where for every tuple,
>> we will lookup if the id exists, we join if it does or else put the tuple
>> into redis. Also, we need to write the final out to a system through REST
>> API (the system doesn't provide any other mechanism to write).
>>
>> Is it a common pattern to read/write to db per tuple? Also, are there any
>> connectors to write to REST endpoints.
>>
>> Regards,
>> Ashish
>>
>
>

Spark Streaming to REST API

2017-12-21 Thread ashish rawat

Hi,

We are working on a streaming solution where multiple out of order streams
are flowing in the system and we need to join the streams based on a unique
id. We are planning to use redis for this, where for every tuple, we will
lookup if the id exists, we join if it does or else put the tuple into
redis. Also, we need to write the final out to a system through REST API
(the system doesn't provide any other mechanism to write).

Is it a common pattern to read/write to db per tuple? Also, are there any
connectors to write to REST endpoints.

Regards,
Ashish

Re: NLTK with Spark Streaming

2017-12-01 Thread ashish rawat

Thanks Nicholas, but the problem for us is that we want to use NLTK Python
library, since our data scientists are training using that. Rewriting the
inference logic using some other library would be time consuming and in
some cases, it may not even work because of unavailability of some
functions.

On Nov 29, 2017 3:16 AM, "Nicholas Hakobian" <
nicholas.hakob...@rallyhealth.com> wrote:

Depending on your needs, its fairly easy to write a lightweight python
wrapper around the Databricks spark-corenlp library: https://github.com/
databricks/spark-corenlp


Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Sun, Nov 26, 2017 at 8:19 AM, ashish rawat  wrote:

> Thanks Holden and Chetan.
>
> Holden - Have you tried it out, do you know the right way to do it?
> Chetan - yes, if we use a Java NLP library, it should not be any issue in
> integrating with spark streaming, but as I pointed out earlier, we want to
> give flexibility to data scientists to use the language and library of
> their choice, instead of restricting them to a library of our choice.
>
> On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> But you can still use Stanford NLP library and distribute through spark
>> right !
>>
>> On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau 
>> wrote:
>>
>>> So it’s certainly doable (it’s not super easy mind you), but until the
>>> arrow udf release goes out it will be rather slow.
>>>
>>> On Sun, Nov 26, 2017 at 8:01 AM ashish rawat 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Has someone tried running NLTK (python) with Spark Streaming (scala)? I
>>>> was wondering if this is a good idea and what are the right Spark operators
>>>> to do this? The reason we want to try this combination is that we don't
>>>> want to run our transformations in python (pyspark), but after the
>>>> transformations, we need to run some natural language processing operations
>>>> and we don't want to restrict the functions data scientists' can use to
>>>> Spark natural language library. So, Spark streaming with NLTK looks like
>>>> the right option, from the perspective of fast data processing and data
>>>> science flexibility.
>>>>
>>>> Regards,
>>>> Ashish
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>

Re: NLTK with Spark Streaming

2017-11-26 Thread ashish rawat

Thanks Holden and Chetan.

Holden - Have you tried it out, do you know the right way to do it?
Chetan - yes, if we use a Java NLP library, it should not be any issue in
integrating with spark streaming, but as I pointed out earlier, we want to
give flexibility to data scientists to use the language and library of
their choice, instead of restricting them to a library of our choice.

On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri 
wrote:

> But you can still use Stanford NLP library and distribute through spark
> right !
>
> On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau 
> wrote:
>
>> So it’s certainly doable (it’s not super easy mind you), but until the
>> arrow udf release goes out it will be rather slow.
>>
>> On Sun, Nov 26, 2017 at 8:01 AM ashish rawat  wrote:
>>
>>> Hi,
>>>
>>> Has someone tried running NLTK (python) with Spark Streaming (scala)? I
>>> was wondering if this is a good idea and what are the right Spark operators
>>> to do this? The reason we want to try this combination is that we don't
>>> want to run our transformations in python (pyspark), but after the
>>> transformations, we need to run some natural language processing operations
>>> and we don't want to restrict the functions data scientists' can use to
>>> Spark natural language library. So, Spark streaming with NLTK looks like
>>> the right option, from the perspective of fast data processing and data
>>> science flexibility.
>>>
>>> Regards,
>>> Ashish
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

NLTK with Spark Streaming

2017-11-25 Thread ashish rawat

Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was
wondering if this is a good idea and what are the right Spark operators to
do this? The reason we want to try this combination is that we don't want
to run our transformations in python (pyspark), but after the
transformations, we need to run some natural language processing operations
and we don't want to restrict the functions data scientists' can use to
Spark natural language library. So, Spark streaming with NLTK looks like
the right option, from the perspective of fast data processing and data
science flexibility.

Regards,
Ashish

Re: Spark based Data Warehouse

2017-11-17 Thread ashish rawat

Thanks everyone for their suggestions. Does any of you take care of auto
scale up and down of your underlying spark clusters on AWS?

On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" 
wrote:

Hi Ashish, bear in mind that EMR has some additional tooling available that
smoothes out some S3 problems that you may / almost certainly will
encounter.

We are using Spark / S3 not on EMR and have encountered issues with file
consistency, you can deal with it but be aware it's additional technical
debt that you'll need to own.  We didn't want to own an HDFS cluster so we
consider it worthwhile.

Here are some additional resources:  The video is Steve Loughran talking
about S3.
https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-
best-practices-a767242f3d98
https://www.youtube.com/watch?v=ND4L_zSDqF0

For the record we use S3 heavily but tend to drop our processed data into
databases so they can be more easily consumed by visualization tools.

Good luck!

Gary Lucas

On 13 November 2017 at 20:04, Affan Syed  wrote:

> Another option that we are trying internally is to uses Mesos for
> isolating different jobs or groups. Within a single group, using Livy to
> create different spark contexts also works.
>
> - Affan
>
> On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat  wrote:
>
>> Thanks Sky Yin. This really helps.
>>
>> On Nov 14, 2017 12:11 AM, "Sky Yin"  wrote:
>>
>> We are running Spark in AWS EMR as data warehouse. All data are in S3 and
>> metadata in Hive metastore.
>>
>> We have internal tools to creat juypter notebook on the dev cluster. I
>> guess you can use zeppelin instead, or Livy?
>>
>> We run genie as a job server for the prod cluster, so users have to
>> submit their queries through the genie. For better resource utilization, we
>> rely on Yarn dynamic allocation to balance the load of multiple
>> jobs/queries in Spark.
>>
>> Hope this helps.
>>
>> On Sat, Nov 11, 2017 at 11:21 PM ashish rawat 
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> I was trying to understand if anyone here has tried a data warehouse
>>> solution using S3 and Spark SQL. Out of multiple possible options
>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>> our aggregates and processing requirements.
>>>
>>> If anyone has tried it out, would like to understand the following:
>>>
>>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>>2. What user interface did you provide for data scientist, data
>>>engineers and analysts
>>>3. What are the challenges in running concurrent queries, by many
>>>    users, over Spark SQL? Considering Spark still does not provide spill to
>>>disk, in many scenarios, are there frequent query failures when executing
>>>concurrent queries
>>>4. Are there any open source implementations, which provide
>>>something similar?
>>>
>>>
>>> Regards,
>>> Ashish
>>>
>>
>>
>

Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat

Thanks Sky Yin. This really helps.

On Nov 14, 2017 12:11 AM, "Sky Yin"  wrote:

We are running Spark in AWS EMR as data warehouse. All data are in S3 and
metadata in Hive metastore.

We have internal tools to creat juypter notebook on the dev cluster. I
guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit
their queries through the genie. For better resource utilization, we rely
on Yarn dynamic allocation to balance the load of multiple jobs/queries in
Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat  wrote:

> Hello Everyone,
>
> I was trying to understand if anyone here has tried a data warehouse
> solution using S3 and Spark SQL. Out of multiple possible options
> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
> our aggregates and processing requirements.
>
> If anyone has tried it out, would like to understand the following:
>
>1. Is Spark SQL and UDF, able to handle all the workloads?
>2. What user interface did you provide for data scientist, data
>engineers and analysts
>3. What are the challenges in running concurrent queries, by many
>users, over Spark SQL? Considering Spark still does not provide spill to
>disk, in many scenarios, are there frequent query failures when executing
>concurrent queries
>4. Are there any open source implementations, which provide something
>similar?
>
>
> Regards,
> Ashish
>

Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat

Thanks Everyone. I am still not clear on what is the right way to execute
support multiple users, running concurrent queries with Spark. Is it
through multiple spark contexts or through Livy (which creates a single
spark context only).

Also, what kind of isolation is possible with Spark SQL? If one user fires
a big query, then would that choke all other queries in the cluster?

Regards,
Ashish

On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell 
wrote:

> Alcon,
>
>
>
> You can most certainly do this. I’ve done benchmarking with Spark SQL and
> the TPCDS queries using S3 as the filesystem.
>
>
>
> Zeppelin and Livy server work well for the dash boarding and concurrent
> query issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-
> apache-spark/
>
>
>
> Livy Server will allow you to create multiple spark contexts via REST:
> https://livy.incubator.apache.org/
>
>
>
> If you are looking for broad SQL functionality I’d recommend instantiating
> a Hive context. And Spark is able to spill to disk à
> https://spark.apache.org/faq.html
>
>
>
> There are multiple companies running spark within their data warehouse
> solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/
> steinbach_dashdb_local_spark/
>
>
>
> Edmunds used Spark to allow business analysts to point Spark to files in
> S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0
>
>
>
> Recommend running some benchmarks and testing query scenarios for your end
> users; but it sounds like you’ll be using it for exploratory analysis.
> Spark is great for this ☺
>
>
>
> -Pat
>
>
>
>
>
> *From: *Vadim Semenov 
> *Date: *Sunday, November 12, 2017 at 1:06 PM
> *To: *Gourav Sengupta 
> *Cc: *Phillip Henry , ashish rawat <
> dceash...@gmail.com>, Jörn Franke , Deepak Sharma <
> deepakmc...@gmail.com>, spark users 
> *Subject: *Re: Spark based Data Warehouse
>
>
>
> It's actually quite simple to answer
>
>
>
> > 1. Is Spark SQL and UDF, able to handle all the workloads?
>
> Yes
>
>
>
> > 2. What user interface did you provide for data scientist, data
> engineers and analysts
>
> Home-grown platform, EMR, Zeppelin
>
>
>
> > What are the challenges in running concurrent queries, by many users,
> over Spark SQL? Considering Spark still does not provide spill to disk, in
> many scenarios, are there frequent query failures when executing concurrent
> queries
>
> You can run separate Spark Contexts, so jobs will be isolated
>
>
>
> > Are there any open source implementations, which provide something
> similar?
>
> Yes, many.
>
>
>
>
>
> On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> Dear Ashish,
>
> what you are asking for involves at least a few weeks of dedicated
> understanding of your used case and then it takes at least 3 to 4 months to
> even propose a solution. You can even build a fantastic data warehouse just
> using C++. The matter depends on lots of conditions. I just think that your
> approach and question needs a lot of modification.
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry 
> wrote:
>
> Hi, Ashish.
>
> You are correct in saying that not *all* functionality of Spark is
> spill-to-disk but I am not sure how this pertains to a "concurrent user
> scenario". Each executor will run in its own JVM and is therefore isolated
> from others. That is, if the JVM of one user dies, this should not effect
> another user who is running their own jobs in their own JVMs. The amount of
> resources used by a user can be controlled by the resource manager.
>
> AFAIK, you configure something like YARN to limit the number of cores and
> the amount of memory in the cluster a certain user or group is allowed to
> use for their job. This is obviously quite a coarse-grained approach as (to
> my knowledge) IO is not throttled. I believe people generally use something
> like Apache Ambari to keep an eye on network and disk usage to mitigate
> problems in a shared cluster.
>
> If the user has badly designed their query, it may very well fail with
> OOMEs but this can happen irrespective of whether one user or many is using
> the cluster at a given moment in time.
>
>
>
> Does this help?
>
> Regards,
>
> Phillip
>
>
>
> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat  wrote:
>
> Thanks Jorn and Phillip. My question was specifically to anyone who have
> tried creating a system using spark SQL, as Data Warehouse. I was trying to
> check, if someone has tried it and they can help with the k

Re: Spark based Data Warehouse

2017-11-12 Thread ashish rawat

Thanks Jorn and Phillip. My question was specifically to anyone who have
tried creating a system using spark SQL, as Data Warehouse. I was trying to
check, if someone has tried it and they can help with the kind of workloads
which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of
spark is spill to disk. So it still doesn't provide DB like reliability in
execution. In case of DBs, queries get slow but they don't fail or go out
of memory, specifically in concurrent user scenarios.

Regards,
Ashish

On Nov 12, 2017 3:02 PM, "Phillip Henry"  wrote:

Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the
Spark CLI. Again, the answer is "it depends" (in this case, on the skills
of your customers).

Regarding sharing resources, different teams were limited to their own
queue so they could not hog all the resources. However, people within a
team had to do some horse trading if they had a particularly intensive job
to run. I did feel that this was an area that could be improved. It may be
by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to
disk" as the FAQ says "Spark's operators spill data to disk if it does not
fit in memory" (http://spark.apache.org/faq.html). So, your data will not
normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip

On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke  wrote:

> What do you mean all possible workloads?
> You cannot prepare any system to do all possible processing.
>
> We do not know the requirements of your data scientists now or in the
> future so it is difficult to say. How do they work currently without the
> new solution? Do they all work on the same data? I bet you will receive on
> your email a lot of private messages trying to sell their solution that
> solves everything - with the information you provided this is impossible to
> say.
>
> Then with every system: have incremental releases but have then in short
> time frames - do not engineer a big system that you will deliver in 2
> years. In the cloud you have the perfect possibility to scale feature but
> also infrastructure wise.
>
> Challenges with concurrent queries is the right definition of the
> scheduler (eg fairscheduler) that not one query take all the resources or
> that long running queries starve.
>
> User interfaces: what could help are notebooks (Jupyter etc) but you may
> need to train your data scientists. Some may know or prefer other tools.
>
> On 12. Nov 2017, at 08:32, Deepak Sharma  wrote:
>
> I am looking for similar solution more aligned to data scientist group.
> The concern i have is about supporting complex aggregations at runtime .
>
> Thanks
> Deepak
>
> On Nov 12, 2017 12:51, "ashish rawat"  wrote:
>
>> Hello Everyone,
>>
>> I was trying to understand if anyone here has tried a data warehouse
>> solution using S3 and Spark SQL. Out of multiple possible options
>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>> our aggregates and processing requirements.
>>
>> If anyone has tried it out, would like to understand the following:
>>
>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>2. What user interface did you provide for data scientist, data
>>engineers and analysts
>>3. What are the challenges in running concurrent queries, by many
>>users, over Spark SQL? Considering Spark still does not provide spill to
>>disk, in many scenarios, are there frequent query failures when executing
>>concurrent queries
>>4. Are there any open source implementations, which provide something
>>similar?
>>
>>
>> Regards,
>> Ashish
>>
>

Spark based Data Warehouse

2017-11-11 Thread ashish rawat

Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse
solution using S3 and Spark SQL. Out of multiple possible options
(redshift, presto, hive etc), we were planning to go with Spark SQL, for
our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:

   1. Is Spark SQL and UDF, able to handle all the workloads?
   2. What user interface did you provide for data scientist, data
   engineers and analysts
   3. What are the challenges in running concurrent queries, by many users,
   over Spark SQL? Considering Spark still does not provide spill to disk, in
   many scenarios, are there frequent query failures when executing concurrent
   queries
   4. Are there any open source implementations, which provide something
   similar?


Regards,
Ashish

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Ashish Singh

Hi ,

You can try https://github.com/hdinsight/spark-eventhubs : which is
eventhub receiver for spark streaming
We are using it but you have scala version only i guess


Thanks,
Ashish Singh

On Fri, Apr 21, 2017 at 9:19 AM, ayan guha  wrote:

> [image: Boxbe] <https://www.boxbe.com/overview> This message is eligible
> for Automatic Cleanup! (guha.a...@gmail.com) Add cleanup rule
> <https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Fkey%3DPt%252BTPB6%252BaAm3LmtP9vbW7ZMOU4jpTi8G2GNqAFpYe1w%253D%26token%3DOEOHdkll%252F0erzMM7h0KwNqZabKV4UYsLoWmILiNNau9%252FISHyFLUuWZ5z8T9OJ4eW0PaVFGHq3BlyLayai5pqh0YV0AivSkVdcdDj7wW6tJAo62njzrZSWNXmIN9IR9SiIP9YhQLx%252FP8%253D&tc_serial=29917919067&tc_rand=1256538785&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
> | More info
> <http://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=29917919067&tc_rand=1256538785&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
>
> Hi
>
> I am not able to find any conector to be used to connect spark streaming
> with Azure Event Hub, using pyspark.
>
> Does anyone know if there is such library/package exists>?
>
> --
> Best Regards,
> Ayan Guha
>
>

Document listing spark sql aggregate functions

2016-10-03 Thread Ashish Tadose

Hi Team,

Is there a documentation page which lists all the aggregation functions
supported in Spark sql query language.

Same as listed DataFrame aggregate functions as below
https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.functions$

I was looking for spark sql query aggregate function to replace df
operation phase using approxCountDistinct().

I could find it in the source as below but not it doc.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

Let me know if its there in the docs and missed it.

Thanks,
Ashish

Spark 2.0 issue

2016-09-29 Thread Ashish Shrowty

If I try to inner-join two dataframes which originated from the same initial
dataframe that was loaded using spark.sql() call, it results in an error -

// reading from Hive .. the data is stored in Parquet format in Amazon
S3
val d1 = spark.sql("select * from ") 
val df1 =
d1.groupBy("key1","key2").agg(avg("totalprice").as("avgtotalprice"))
val df2 = d1.groupBy("key1","key2").agg(avg("itemcount").as("avgqty")) 
df1.join(df2, Seq("key1","key2")) gives error -
 org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can
not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];

If the same Dataframe is initialized via spark.read.parquet(), the above
code works. This same code above also worked with Spark 1.6.2. I created a
JIRA too ..  SPARK-17709 <https://issues.apache.org/jira/browse/SPARK-17709>  

Any help appreciated!

Thanks,
Ashish



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-issue-tp27818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark can't connect to secure phoenix

2016-09-16 Thread Ashish Gupta

Hi All,

I am running a spark program on secured cluster which creates SqlContext for 
creating dataframe over phoenix table.

When I run my program in local mode with --master option set to local[2] my 
program works completely fine, however when I try to run same program with 
master option set to yarn-client, I am getting below exception:

Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed 
after attempts=5, exceptions:
Fri Sep 16 12:14:10 IST 2016, RpcRetryingCaller{globalStartTime=1474008247898, 
pause=100, retries=5}, org.apache.hadoop.hbase.MasterNotRunningException: 
com.google.protobuf.ServiceException: java.io.IOException: Could not set up IO 
Streams to demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:4083)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:528)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:550)
at 
org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:810)
... 50 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: 
com.google.protobuf.ServiceException: java.io.IOException: Could not set up IO 
Streams to demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1540)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1560)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1711)
at 
org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
... 54 more
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Could 
not set up IO Streams to demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:58152)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(ConnectionManager.java:1571)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1509)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1531)
... 58 more
Caused by: java.io.IOException: Could not set up IO Streams to 
demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:779)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:887)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:856)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1200)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
... 63 more
Caused by: java.lang.RuntimeException: SASL authentication failed. The most 
likely cause is missing or invalid credentials. Consider 'kinit'.
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:679)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:637)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:745)
... 67 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused 
by GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientIm

Returning DataFrame as Scala method return type

2016-09-08 Thread Ashish Tadose

Hi Team,

I have Spark job with large number of dataframe operations.

This job reads various lookup data from external table as MySql and also
run lot of dataframe operations on large data on hdfs in parquet.

Job works fine in cluster however jobdriver code looks clumsy because of
large number of operations written in driver method.

I wish to organize these dataframe operations by grouping them Scala Object
methods.
Something like below



> *Object Driver {*
> *def main(args: Array[String]) {*
> *  val df = Operations.process(sparkContext)*
> *  }**}*
>
>
> *Object Operations {*
> *  def process(sparkContext: SparkContext) : DataFrame = {*
> *//series of dataframe operations *
> *  }**}*


My stupid question is would retrieving DF from other Scala Object's method
as return type is right thing do in terms of large scale.
Would returning DF to driver will cause all data get passed to the driver
code or it would be return just pointer to the DF?


Thanks,
Ashish

Logstash to collect Spark logs

2016-05-20 Thread Ashish Kumar Singh

We are trying to collect Spark logs using logstash  for parsing app logs
and collecting useful info.

We can read the Nodemanager logs but unable to read Spark application logs
using Logstash .

Current Setup for Spark logs and Logstash
1-  Spark runs on Yarn .
2-  Using log4j socketAppenders to write logs to tcp port .
3- Below lines added in log4j.properties of Yarn and Spark conf:

main.logger=RFA,SA
 log4j.appender.SA=org.apache.log4j.net.SocketAppender
log4j.appender.SA.Port=4560
log4j.appender.SA.RemoteHost=${hostname}
log4j.appender.SA.ReconnectionDelay=1
log4j.appender.SA.Application=NM-${user.dir}

4-Logstash input
  input {
  log4j {
mode => "server"
host => "0.0.0.0"
port => 4560
type => "log4j"
  }
}


Any help on reading Spark logs via Logstash will be appreciated  .
Also, is there a better way to collect Spark logs via Logstash ?

Spark log collection via Logstash

2016-05-19 Thread Ashish Kumar Singh

We are trying to collect Spark logs using logstash  for parsing app logs
and collecting useful info.

We can read the Nodemanager logs but unable to read Spark application logs
using Logstash .

Current Setup for Spark logs and Logstash
1-  Spark runs on Yarn .
2-  Using log4j socketAppenders to write logs to tcp port .
3- Below lines added in log4j.properties of Yarn and Spark
main.logger=RFA,SA

log4j.appender.SA=org.apache.log4j.net.SocketAppender
log4j.appender.SA.Port=4560
log4j.appender.SA.RemoteHost=${hostname}
log4j.appender.SA.ReconnectionDelay=1
log4j.appender.SA.Application=NM-${user.dir}

4-Logstash input
  input {
  log4j {
mode => "server"
host => "0.0.0.0"
port => 4560
type => "log4j"
  }
}


Any help on reading Spark logs via Logstash will be appreciated  .
Also, is there a better way to collect Spark logs via Logstash ?

Thanks,
Ashish

Re: Joining a RDD to a Dataframe

2016-05-08 Thread Ashish Dubey

Is there any reason you dont want to convert this - i dont think join b/w
RDD and DF is supported.

On Sat, May 7, 2016 at 11:41 PM, Cyril Scetbon 
wrote:

> Hi,
>
> I have a RDD built during a spark streaming job and I'd like to join it to
> a DataFrame (E/S input) to enrich it.
> It seems that I can't join the RDD and the DF without converting first the
> RDD to a DF (Tell me if I'm wrong). Here are the schemas of both DF :
>
> scala> df
> res32: org.apache.spark.sql.DataFrame = [f1: string, addresses:
> array>, id: string]
>
> scala> df_input
> res33: org.apache.spark.sql.DataFrame = [id: string]
>
> scala> df_input.collect
> res34: Array[org.apache.spark.sql.Row] = Array([idaddress2], [idaddress12])
>
> I can get ids I want if I know the value to look for in addresses.id
> using :
>
> scala> df.filter(array_contains(df("addresses.id"),
> "idaddress2")).select("id").collect
> res35: Array[org.apache.spark.sql.Row] = Array([], [YY])
>
> However when I try to join df_input and df and to use the previous filter
> as the join condition I get an exception :
>
> scala> df.join(df_input, array_contains(df("adresses.id"),
> df_input("id")))
> java.lang.RuntimeException: Unsupported literal type class
> org.apache.spark.sql.Column id
> at
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:50)
> at
> org.apache.spark.sql.functions$.array_contains(functions.scala:2452)
> ...
>
> It seems that array_contains only supports static arguments and does not
> replace a sql.Column by its value.
>
> What's the best way to achieve what I want to do ? (Also speaking in term
> of performance)
>
> Thanks
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-08 Thread Ashish Dubey

I see the behavior - so it always goes with min total tasks possible on
your settings ( num-executors * num-cores ) - however if you use a huge
amount of data then you will see more tasks - that means it has some kind
of lower bound on num-tasks.. It may require some digging. other formats
did not seem to have this issue.

On Sun, May 8, 2016 at 12:10 AM, Johnny W.  wrote:

> The file size is very small (< 1M). The stage launches every time i call:
> --
> sqlContext.read.parquet(path_to_file)
>
> These are the parquet specific configurations I set:
> --
> spark.sql.parquet.filterPushdown: true
> spark.sql.parquet.mergeSchema: true
>
> Thanks,
> J.
>
> On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey  wrote:
>
>> How big is your file and can you also share the code snippet
>>
>>
>> On Saturday, May 7, 2016, Johnny W.  wrote:
>>
>>> hi spark-user,
>>>
>>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
>>> dataframe from a parquet data source with a single parquet file, it yields
>>> a stage with lots of small tasks. It seems the number of tasks depends on
>>> how many executors I have instead of how many parquet files/partitions I
>>> have. Actually, it launches 5 tasks on each executor.
>>>
>>> This behavior is quite strange, and may have potential issue if there is
>>> a slow executor. What is this "parquet" stage for? and why it launches 5
>>> tasks on each executor?
>>>
>>> Thanks,
>>> J.
>>>
>>
>

Re: BlockManager crashing applications

2016-05-08 Thread Ashish Dubey

   1. Caused by: java.io.IOException: Failed to connect to
   ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
   2. at
   
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
   3. at
   
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
   4. at
   
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
   5. at
   
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
   6. at
   
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
   7. at
   
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)


Above message indicates that there used to be a executor on that address
and by the time other executor was about to read - it did not exist. You
may also be able to confirm ( if this is the case )by looking at spark App
ui - you may find dead executors..

On Sun, May 8, 2016 at 6:02 PM, Brandon White 
wrote:

> I'm not quite sure how this is a memory problem. There are no OOM
> exceptions and the job only breaks when actions are ran in parallel,
> submitted to the scheduler by different threads.
>
> The issue is that the doGetRemote function does not retry when it is
> denied access to a cache block.
> On May 8, 2016 5:55 PM, "Ashish Dubey"  wrote:
>
> Brandon,
>
> how much memory are you giving to your executors - did you check if there
> were dead executors in your application logs.. Most likely you require
> higher memory for executors..
>
> Ashish
>
> On Sun, May 8, 2016 at 1:01 PM, Brandon White 
> wrote:
>
>> Hello all,
>>
>> I am running a Spark application which schedules multiple Spark jobs.
>> Something like:
>>
>> val df  = sqlContext.read.parquet("/path/to/file")
>>
>> filterExpressions.par.foreach { expression =>
>>   df.filter(expression).count()
>> }
>>
>> When the block manager fails to fetch a block, it throws an exception
>> which eventually kills the exception: http://pastebin.com/2ggwv68P
>>
>> This code works when I run it on one thread with:
>>
>> filterExpressions.foreach { expression =>
>>   df.filter(expression).count()
>> }
>>
>> But I really need the parallel execution of the jobs. Is there anyway
>> around this? It seems like a bug in the BlockManagers doGetRemote function.
>> I have tried the HTTP Block Manager as well.
>>
>
>

Re: BlockManager crashing applications

2016-05-08 Thread Ashish Dubey

Brandon,

how much memory are you giving to your executors - did you check if there
were dead executors in your application logs.. Most likely you require
higher memory for executors..

Ashish

On Sun, May 8, 2016 at 1:01 PM, Brandon White 
wrote:

> Hello all,
>
> I am running a Spark application which schedules multiple Spark jobs.
> Something like:
>
> val df  = sqlContext.read.parquet("/path/to/file")
>
> filterExpressions.par.foreach { expression =>
>   df.filter(expression).count()
> }
>
> When the block manager fails to fetch a block, it throws an exception
> which eventually kills the exception: http://pastebin.com/2ggwv68P
>
> This code works when I run it on one thread with:
>
> filterExpressions.foreach { expression =>
>   df.filter(expression).count()
> }
>
> But I really need the parallel execution of the jobs. Is there anyway
> around this? It seems like a bug in the BlockManagers doGetRemote function.
> I have tried the HTTP Block Manager as well.
>

Re: Parse Json in Spark

2016-05-08 Thread Ashish Dubey

This limit is due to underlying inputFormat implementation.  you can always
write your own inputFormat and then use spark newAPIHadoopFile api to pass
your inputFormat class path. You will have to place the jar file in /lib
location on all the nodes..

Ashish

On Sun, May 8, 2016 at 4:02 PM, Hyukjin Kwon  wrote:

>
> I remember this Jira, https://issues.apache.org/jira/browse/SPARK-7366.
> Parsing multiple lines are not supported in Json fsta source.
>
> Instead this can be done by sc.wholeTextFiles(). I found some examples
> here,
> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
>
> Although this reads a file as a whole record, this should work.
>
> Thanks!
> On 9 May 2016 7:20 a.m., "KhajaAsmath Mohammed" 
> wrote:
>
>> Hi,
>>
>> I am working on parsing the json in spark but most of the information
>> available online states that  I need to have entire JSON in single line.
>>
>> In my case, Json file is delivered in complex structure and not in a
>> single line. could anyone know how to process this in SPARK.
>>
>> I used Jackson jar to process json and was able to do it when it is
>> present in single line. Any ideas?
>>
>> Thanks,
>> Asmath
>>
>

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ashish Dubey

Driver maintains the complete metadata of application ( scheduling of
executor and maintaining the messaging to control the execution )
This code seems to be failing in that code path only. With that said there
is Jvm overhead based on num of executors , stages and tasks in your app.
Do you know your driver heap size and application structure ( num of stages
and tasks )

Ashish
On Saturday, May 7, 2016, Nirav Patel  wrote:

> Right but this logs from spark driver and spark driver seems to use Akka.
>
> ERROR [sparkDriver-akka.actor.default-dispatcher-17]
> akka.actor.ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
> ActorSystem [sparkDriver]
>
> I saw following logs before above happened.
>
> 2016-05-06 09:49:17,813 INFO
> [sparkDriver-akka.actor.default-dispatcher-17]
> org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
> locations for shuffle 1 to hdn6.xactlycorporation.local:44503
>
>
> As far as I know driver is just driving shuffle operation but not actually
> doing anything within its own system that will cause memory issue. Can you
> explain in what circumstances I could see this error in driver logs? I
> don't do any collect or any other driver operation that would cause this.
> It fails when doing aggregateByKey operation but that should happen in
> executor JVM NOT in driver JVM.
>
>
> Thanks
>
> On Sat, May 7, 2016 at 11:58 AM, Ted Yu  > wrote:
>
>> bq.   at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>>
>> It was Akka which uses JavaSerializer
>>
>> Cheers
>>
>> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel > > wrote:
>>
>>> Hi,
>>>
>>> I thought I was using kryo serializer for shuffle.  I could verify it
>>> from spark UI - Environment tab that
>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>> spark.kryo.registrator
>>> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>>>
>>>
>>> But when I see following error in Driver logs it looks like spark is
>>> using JavaSerializer
>>>
>>> 2016-05-06 09:49:26,490 ERROR
>>> [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
>>> Uncaught fatal error from thread
>>> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
>>> ActorSystem [sparkDriver]
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>
>>> at java.util.Arrays.copyOf(Arrays.java:2271)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>>
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>>>
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>>>
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>>>
>>> at
>>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>>>
>>> at
>>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
>>>
>>> at
>>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>>>
>>> at
>>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>>>
>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>
>>> at
>>> akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>>>
>>> at
>>> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>>>
>>> at
>>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>>
>>> at
>>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>>
>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>
>>> at
>>> akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
>>>
>>> at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
>>>
>>> at
>>>

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Ashish Dubey

How big is your file and can you also share the code snippet

On Saturday, May 7, 2016, Johnny W.  wrote:

> hi spark-user,
>
> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
> dataframe from a parquet data source with a single parquet file, it yields
> a stage with lots of small tasks. It seems the number of tasks depends on
> how many executors I have instead of how many parquet files/partitions I
> have. Actually, it launches 5 tasks on each executor.
>
> This behavior is quite strange, and may have potential issue if there is a
> slow executor. What is this "parquet" stage for? and why it launches 5
> tasks on each executor?
>
> Thanks,
> J.
>

Re: Spark for Log Analytics

2016-03-31 Thread ashish rawat

Thanks for your replies Steve and Chris.

Steve,

I am creating a real-time pipeline, so I am not looking to dump data to
hdfs rite now. Also, since the log sources would be Nginx, Mongo and
application events, it might not be possible to always route events
directly from the source to flume. Therefore, I thought that "tail -f"
strategy used by fluentd, logstash and others might be the only unifying
solution to collect logs.

Chris,

Can you please elaborate on the Source to Kafka part. Do all event sources
have integration with Kafka. Eg. if you need to send the Server Logs
(Apache/Nginx/Mongo etc) to Kafka, what could be the ideal strategy?

Regards,
Ashish

On Thu, Mar 31, 2016 at 5:16 PM, Chris Fregly  wrote:

> oh, and I forgot to mention Kafka Streams which has been heavily talked
> about the last few days at Strata here in San Jose.
>
> Streams can simplify a lot of this architecture by perform some
> light-to-medium-complex transformations in Kafka itself.
>
> i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams,
> so I can try this out myself - and hopefully remove a lot of extra plumbing.
>
> On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly  wrote:
>
>> this is a very common pattern, yes.
>>
>> note that in Netflix's case, they're currently pushing all of their logs
>> to a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
>> ElasticSearch, and/or another Kafka Topic for further consumption by
>> internal apps using other technologies like Spark Streaming (instead of
>> Samza).
>>
>> this Fronting Kafka + Samza Router also helps to differentiate between
>> high-priority events (Errors or High Latencies) and normal-priority events
>> (normal User Play or Stop events).
>>
>> here's a recent presentation i did which details this configuration
>> starting at slide 104:
>> http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
>> .
>>
>> btw, Confluent's distribution of Kafka does have a direct Http/REST API
>> which is not recommended for production use, but has worked well for me in
>> the past.
>>
>> these are some additional options to think about, anyway.
>>
>>
>> On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran 
>> wrote:
>>
>>>
>>> On 31 Mar 2016, at 09:37, ashish rawat  wrote:
>>>
>>> Hi,
>>>
>>> I have been evaluating Spark for analysing Application and Server Logs.
>>> I believe there are some downsides of doing this:
>>>
>>> 1. No direct mechanism of collecting log, so need to introduce other
>>> tools like Flume into the pipeline.
>>>
>>>
>>> you need something to collect logs no matter what you run. Flume isn't
>>> so bad; if you bring it up on the same host as the app then you can even
>>> collect logs while the network is playing up.
>>>
>>> Or you can just copy log4j files to HDFS and process them later
>>>
>>> 2. Need to write lots of code for parsing different patterns from logs,
>>> while some of the log analysis tools like logstash or loggly provide it out
>>> of the box
>>>
>>>
>>>
>>> Log parsing is essentially an ETL problem, especially if you don't try
>>> to lock down the log event format.
>>>
>>> You can also configure Log4J to save stuff in an easy-to-parse format
>>> and/or forward directly to your application.
>>>
>>> There's a log4j to flume connector to do that for you,
>>>
>>>
>>> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>>>
>>> or you can output in, say, JSON (
>>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>>>  )
>>>
>>> I'd go with flume unless you had a need to save the logs locally and
>>> copy them to HDFS laster.
>>>
>>>
>>>
>>> On the benefits side, I believe Spark might be more performant (although
>>> I am yet to benchmark it) and being a generic processing engine, might work
>>> with complex use cases where the out of the box functionality of log
>>> analysis tools is not sufficient (although I don't have any such use case
>>> right now).
>>>
>>> One option I was considering was to use logstash for collection and
>>> basic processing and then sink the processed logs to both elastic search
>>> and kafka. So that Spark Streaming can pick data from Kafka for the complex
>>> use cases, while logstash filters can be used for the simpler use cases.
>>>
>>> I was wondering if someone has already done this evaluation and could
>>> provide me some pointers on how/if to create this pipeline with Spark.
>>>
>>> Regards,
>>> Ashish
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>

Spark for Log Analytics

2016-03-31 Thread ashish rawat

Hi,

I have been evaluating Spark for analysing Application and Server Logs. I
believe there are some downsides of doing this:

1. No direct mechanism of collecting log, so need to introduce other tools
like Flume into the pipeline.
2. Need to write lots of code for parsing different patterns from logs,
while some of the log analysis tools like logstash or loggly provide it out
of the box

On the benefits side, I believe Spark might be more performant (although I
am yet to benchmark it) and being a generic processing engine, might work
with complex use cases where the out of the box functionality of log
analysis tools is not sufficient (although I don't have any such use case
right now).

One option I was considering was to use logstash for collection and basic
processing and then sink the processed logs to both elastic search and
kafka. So that Spark Streaming can pick data from Kafka for the complex use
cases, while logstash filters can be used for the simpler use cases.

I was wondering if someone has already done this evaluation and could
provide me some pointers on how/if to create this pipeline with Spark.

Regards,
Ashish

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni

When you say driver running on mesos can you explain how are you doing that...??

> On Mar 10, 2016, at 4:44 PM, Eran Chinthaka Withana 
>  wrote:
> 
> Yanling I'm already running the driver on mesos (through docker). FYI, I'm 
> running this on cluster mode with MesosClusterDispatcher.
> 
> Mac (client) > MesosClusterDispatcher > Driver running on Mesos --> 
> Workers running on Mesos
> 
> My next step is to run MesosClusterDispatcher in mesos through marathon. 
> 
> Thanks,
> Eran Chinthaka Withana
> 
>> On Thu, Mar 10, 2016 at 11:11 AM, yanlin wang  wrote:
>> How you guys make driver docker within container to be reachable from spark 
>> worker ? 
>> 
>> Would you share your driver docker? i am trying to put only driver in docker 
>> and spark running with yarn outside of container and i don’t want to use 
>> —net=host 
>> 
>> Thx
>> Yanlin
>> 
>>> On Mar 10, 2016, at 11:06 AM, Guillaume Eynard Bontemps 
>>>  wrote:
>>> 
>>> Glad to hear it. Thanks all  for sharing your  solutions.
>>> 
>>> 
>>> Le jeu. 10 mars 2016 19:19, Eran Chinthaka Withana 
>>>  a écrit :
 Phew, it worked. All I had to do was to add export 
 SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6"
  before calling spark-submit. Guillaume, thanks for the pointer. 
 
 Timothy, thanks for looking into this. Looking forward to see a fix soon. 
 
 Thanks,
 Eran Chinthaka Withana
 
> On Thu, Mar 10, 2016 at 10:10 AM, Tim Chen  wrote:
> Hi Eran,
> 
> I need to investigate but perhaps that's true, we're using 
> SPARK_JAVA_OPTS to pass all the options and not --conf.
> 
> I'll take a look at the bug, but if you can try the workaround and see if 
> that fixes your problem.
> 
> Tim
> 
>> On Thu, Mar 10, 2016 at 10:08 AM, Eran Chinthaka Withana 
>>  wrote:
>> Hi Timothy
>> 
>>> What version of spark are you guys running?
>> 
>> I'm using Spark 1.6.0. You can see the Dockerfile I used here: 
>> https://github.com/echinthaka/spark-mesos-docker/blob/master/docker/mesos-spark/Dockerfile
>>  
>>  
>>> And also did you set the working dir in your image to be spark home?
>> 
>> Yes I did. You can see it here: https://goo.gl/8PxtV8
>> 
>> Can it be because of this: 
>> https://issues.apache.org/jira/browse/SPARK-13258 as Guillaume pointed 
>> out above? As you can see, I'm passing in the docker image URI through 
>> spark-submit (--conf 
>> spark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6)
>> 
>> Thanks,
>> Eran
>

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni

Hi Tim ,

Can you please share your dockerfiles and configuration as it will help a
lot , I am planing to publish a blog post on the same .

Ashish

On Thu, Mar 10, 2016 at 10:34 AM, Timothy Chen  wrote:

> No you don't need to install spark on each slave, we have been running
> this setup in Mesosphere without any problem at this point, I think most
> likely configuration problem and perhaps a chance something is missing in
> the code to handle some cases.
>
> What version of spark are you guys running? And also did you set the
> working dir in your image to be spark home?
>
> Tim
>
>
> On Mar 10, 2016, at 3:11 AM, Ashish Soni  wrote:
>
> You need to install spark on each mesos slave and then while starting
> container make a workdir to your spark home so that it can find the spark
> class.
>
> Ashish
>
> On Mar 10, 2016, at 5:22 AM, Guillaume Eynard Bontemps <
> g.eynard.bonte...@gmail.com> wrote:
>
> For an answer to my question see this:
> http://stackoverflow.com/a/35660466?noredirect=1.
>
> But for your problem did you define  the  Spark.mesos.docker. home or
> something like  that property?
>
> Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana <
> eran.chinth...@gmail.com> a écrit :
>
>> Hi
>>
>> I'm also having this issue and can not get the tasks to work inside mesos.
>>
>> In my case, the spark-submit command is the following.
>>
>> $SPARK_HOME/bin/spark-submit \
>>  --class com.mycompany.SparkStarter \
>>  --master mesos://mesos-dispatcher:7077 \ --name SparkStarterJob \
>> --driver-memory 1G \
>>  --executor-memory 4G \
>> --deploy-mode cluster \
>>  --total-executor-cores 1 \
>>  --conf 
>> spark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6 \
>>  http://abc.com/spark-starter.jar
>>
>>
>> And the error I'm getting is the following.
>>
>> I0310 03:13:11.417009 131594 exec.cpp:132] Version: 0.23.1
>> I0310 03:13:11.419452 131601 exec.cpp:206] Executor registered on slave 
>> 20160223-000314-3439362570-5050-631-S0
>> sh: 1: /usr/spark-1.6.0-bin-hadoop2.6/bin/spark-class: not found
>>
>>
>> (Looked into Spark JIRA and I found that
>> https://issues.apache.org/jira/browse/SPARK-11759 is marked as closed
>> since https://issues.apache.org/jira/browse/SPARK-12345 is marked as
>> resolved)
>>
>> Really appreciate if I can get some help here.
>>
>> Thanks,
>> Eran Chinthaka Withana
>>
>> On Wed, Feb 17, 2016 at 2:00 PM, g.eynard.bonte...@gmail.com <
>> g.eynard.bonte...@gmail.com> wrote:
>>
>>> Hi everybody,
>>>
>>> I am testing the use of Docker for executing Spark algorithms on MESOS. I
>>> managed to execute Spark in client mode with executors inside Docker,
>>> but I
>>> wanted to go further and have also my Driver running into a Docker
>>> Container. Here I ran into a behavior that I'm not sure is normal, let me
>>> try to explain.
>>>
>>> I submit my spark application through MesosClusterDispatcher using a
>>> command
>>> like:
>>> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> mesos://spark-master-1:7077 --deploy-mode cluster --conf
>>> spark.mesos.executor.docker.image=myuser/myimage:0.0.2
>>>
>>> https://storage.googleapis.com/some-bucket/spark-examples-1.5.2-hadoop2.6.0.jar
>>> 10
>>>
>>> My driver is running fine, inside its docker container, but the executors
>>> fail:
>>> "sh: /some/spark/home/bin/spark-class: No such file or directory"
>>>
>>> Looking on MESOS slaves log, I think that the executors do not run inside
>>> docker: "docker.cpp:775] No container info found, skipping launch". As my
>>> Mesos slaves do not have spark installed, it fails.
>>>
>>> *It seems that the spark conf that I gave in the first spark-submit is
>>> not
>>> transmitted to the Driver submitted conf*, when launched in the docker
>>> container. The only workaround I found is to modify my Docker image in
>>> order
>>> to define inside its spark conf the spark.mesos.executor.docker.image
>>> property. This way, my executors get the conf well and are launched
>>> inside
>>> docker on Mesos. This seems a little complicated to me, and I feel the
>>> configuration passed to the early spark-submit should be transmitted to
>>> the
>>> Driver submit...
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-mixing-MESOS-Cluster-Mode-and-Docker-task-execution-tp26258.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni

You need to install spark on each mesos slave and then while starting container 
make a workdir to your spark home so that it can find the spark class.

Ashish

> On Mar 10, 2016, at 5:22 AM, Guillaume Eynard Bontemps 
>  wrote:
> 
> For an answer to my question see this: 
> http://stackoverflow.com/a/35660466?noredirect=1.
> 
> But for your problem did you define  the  Spark.mesos.docker. home or 
> something like  that property?
> 
> 
> Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana  
> a écrit :
>> Hi
>> 
>> I'm also having this issue and can not get the tasks to work inside mesos.
>> 
>> In my case, the spark-submit command is the following.
>> 
>> $SPARK_HOME/bin/spark-submit \
>>  --class com.mycompany.SparkStarter \
>>  --master mesos://mesos-dispatcher:7077 \
>>  --name SparkStarterJob \
>> --driver-memory 1G \
>>  --executor-memory 4G \
>> --deploy-mode cluster \
>>  --total-executor-cores 1 \
>>  --conf 
>> spark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6 \
>>  http://abc.com/spark-starter.jar
>> 
>> And the error I'm getting is the following.
>> 
>> I0310 03:13:11.417009 131594 exec.cpp:132] Version: 0.23.1
>> I0310 03:13:11.419452 131601 exec.cpp:206] Executor registered on slave 
>> 20160223-000314-3439362570-5050-631-S0
>> sh: 1: /usr/spark-1.6.0-bin-hadoop2.6/bin/spark-class: not found
>> 
>> (Looked into Spark JIRA and I found that 
>> https://issues.apache.org/jira/browse/SPARK-11759 is marked as closed since 
>> https://issues.apache.org/jira/browse/SPARK-12345 is marked as resolved)
>> 
>> Really appreciate if I can get some help here.
>> 
>> Thanks,
>> Eran Chinthaka Withana
>> 
>>> On Wed, Feb 17, 2016 at 2:00 PM, g.eynard.bonte...@gmail.com 
>>>  wrote:
>>> Hi everybody,
>>> 
>>> I am testing the use of Docker for executing Spark algorithms on MESOS. I
>>> managed to execute Spark in client mode with executors inside Docker, but I
>>> wanted to go further and have also my Driver running into a Docker
>>> Container. Here I ran into a behavior that I'm not sure is normal, let me
>>> try to explain.
>>> 
>>> I submit my spark application through MesosClusterDispatcher using a command
>>> like:
>>> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> mesos://spark-master-1:7077 --deploy-mode cluster --conf
>>> spark.mesos.executor.docker.image=myuser/myimage:0.0.2
>>> https://storage.googleapis.com/some-bucket/spark-examples-1.5.2-hadoop2.6.0.jar
>>> 10
>>> 
>>> My driver is running fine, inside its docker container, but the executors
>>> fail:
>>> "sh: /some/spark/home/bin/spark-class: No such file or directory"
>>> 
>>> Looking on MESOS slaves log, I think that the executors do not run inside
>>> docker: "docker.cpp:775] No container info found, skipping launch". As my
>>> Mesos slaves do not have spark installed, it fails.
>>> 
>>> *It seems that the spark conf that I gave in the first spark-submit is not
>>> transmitted to the Driver submitted conf*, when launched in the docker
>>> container. The only workaround I found is to modify my Docker image in order
>>> to define inside its spark conf the spark.mesos.executor.docker.image
>>> property. This way, my executors get the conf well and are launched inside
>>> docker on Mesos. This seems a little complicated to me, and I feel the
>>> configuration passed to the early spark-submit should be transmitted to the
>>> Driver submit...
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-mixing-MESOS-Cluster-Mode-and-Docker-task-execution-tp26258.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org

Looking for Collaborator - Boston ( Spark Training )

2016-03-05 Thread Ashish Soni

Hi All,

I am developing a detailed highly technical course on spark ( beyond word count 
) and looking for a partner , let me know if anyone is interested.

Ashish 
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.5 on Mesos

2016-03-04 Thread Ashish Soni

It did not helped , same error , Is this the issue i am running into
https://issues.apache.org/jira/browse/SPARK-11638

*Warning: Local jar /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar
does not exist, skipping.*
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi

On Thu, Mar 3, 2016 at 4:12 PM, Tim Chen  wrote:

> Ah I see, I think it's because you've launched the Mesos slave in a docker
> container, and when you launch also the executor in a container it's not
> able to mount in the sandbox to the other container since the slave is in a
> chroot.
>
> Can you try mounting in a volume from the host when you launch the slave
> for your slave's workdir?
> docker run -v /tmp/mesos/slave:/tmp/mesos/slave mesos_image mesos-slave
> --work_dir=/tmp/mesos/slave ....
>
> Tim
>
> On Thu, Mar 3, 2016 at 4:42 AM, Ashish Soni  wrote:
>
>> Hi Tim ,
>>
>>
>> I think I know the problem but i do not have a solution , *The Mesos
>> Slave supposed to download the Jars from the URI specified and placed in
>> $MESOS_SANDBOX location but it is not downloading not sure why* .. see
>> below logs
>>
>> My command looks like below
>>
>> docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
>> SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
>> --deploy-mode cluster --class org.apache.spark.examples.SparkPi
>> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>>
>> [root@Mindstorm spark-1.6.0]# docker logs d22d8e897b79
>> *Warning: Local jar
>> /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar does not exist,
>> skipping.*
>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:278)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> When i do docker inspect i see below command gets issued
>>
>> "Cmd": [
>> "-c",
>> "./bin/spark-submit --name org.apache.spark.examples.SparkPi
>> --master mesos://10.0.2.15:5050 --driver-cores 1.0 --driver-memory 1024M
>> --class org.apache.spark.examples.SparkPi 
>> $*MESOS_SANDBOX*/spark-examples-1.6.0-hadoop2.6.0.jar
>> "
>>
>>
>>
>> On Thu, Mar 3, 2016 at 12:09 AM, Tim Chen  wrote:
>>
>>> You shouldn't need to specify --jars at all since you only have one jar.
>>>
>>> The error is pretty odd as it suggests it's trying to load
>>> /opt/spark/Example but that doesn't really seem to be anywhere in your
>>> image or command.
>>>
>>> Can you paste your stdout from the driver task launched by the cluster
>>> dispatcher, that shows you the spark-submit command it eventually ran?
>>>
>>>
>>> Tim
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 5:42 PM, Ashish Soni 
>>> wrote:
>>>
>>>> See below  and Attached the Dockerfile to build the spark image  (
>>>> between i just upgraded to 1.6 )
>>>>
>>>> I am running below setup -
>>>>
>>>> Mesos Master - Docker Container
>>>> Mesos Slave 1 - Docker Container
>>>> Mesos Slave 2 - Docker Container
>>>> Marathon - Docker Container
>>>> Spark MESOS Dispatcher - Docker Container
>>>>
>>>> when i submit the Spark PI Example Job using below command
>>>>
>>>> *docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077
>>>> <http://10.0.2.15:7077>"  -e SPARK_IMAGE="spark_driver:**latest"
>>>> spark_driver:latest ./bin/spark-submit  --deploy-mode cluster --name "PI
>>

Re: Spark 1.5 on Mesos

2016-03-03 Thread Ashish Soni

Hi Tim ,


I think I know the problem but i do not have a solution , *The Mesos Slave
supposed to download the Jars from the URI specified and placed in
$MESOS_SANDBOX location but it is not downloading not sure why* .. see
below logs

My command looks like below

docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
--deploy-mode cluster --class org.apache.spark.examples.SparkPi
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar

[root@Mindstorm spark-1.6.0]# docker logs d22d8e897b79
*Warning: Local jar /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar
does not exist, skipping.*
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

When i do docker inspect i see below command gets issued

"Cmd": [
"-c",
"./bin/spark-submit --name org.apache.spark.examples.SparkPi
--master mesos://10.0.2.15:5050 --driver-cores 1.0 --driver-memory 1024M
--class org.apache.spark.examples.SparkPi
$*MESOS_SANDBOX*/spark-examples-1.6.0-hadoop2.6.0.jar
"



On Thu, Mar 3, 2016 at 12:09 AM, Tim Chen  wrote:

> You shouldn't need to specify --jars at all since you only have one jar.
>
> The error is pretty odd as it suggests it's trying to load
> /opt/spark/Example but that doesn't really seem to be anywhere in your
> image or command.
>
> Can you paste your stdout from the driver task launched by the cluster
> dispatcher, that shows you the spark-submit command it eventually ran?
>
>
> Tim
>
>
>
> On Wed, Mar 2, 2016 at 5:42 PM, Ashish Soni  wrote:
>
>> See below  and Attached the Dockerfile to build the spark image  (
>> between i just upgraded to 1.6 )
>>
>> I am running below setup -
>>
>> Mesos Master - Docker Container
>> Mesos Slave 1 - Docker Container
>> Mesos Slave 2 - Docker Container
>> Marathon - Docker Container
>> Spark MESOS Dispatcher - Docker Container
>>
>> when i submit the Spark PI Example Job using below command
>>
>> *docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077
>> <http://10.0.2.15:7077>"  -e SPARK_IMAGE="spark_driver:**latest"
>> spark_driver:latest ./bin/spark-submit  --deploy-mode cluster --name "PI
>> Example" --class org.apache.spark.examples.**SparkPi
>> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>> <http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar> --jars
>> /opt/spark/lib/spark-examples-**1.6.0-hadoop2.6.0.jar --verbose*
>>
>> Below is the ERROR
>> Error: Cannot load main class from JAR file:/opt/spark/Example
>> Run with --help for usage help or --verbose for debug output
>>
>>
>> When i docker Inspect for the stopped / dead container i see below output
>> what is interesting to see is some one or executor replaced by original
>> command with below in highlighted and i do not see Executor is downloading
>> the JAR -- IS this a BUG i am hitting or not sure if that is supposed to
>> work this way and i am missing some configuration
>>
>> "Env": [
>> "SPARK_IMAGE=spark_driver:latest",
>> "SPARK_SCALA_VERSION=2.10",
>> "SPARK_VERSION=1.6.0",
>> "SPARK_EXECUTOR_URI=
>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz";,
>> "MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos-0.25.0.so",
>> "SPARK_MASTER=mesos://10.0.2.15:7077",
>>
>> "SPARK_EXECUTOR_OPTS=-Dspark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/
>> libmesos-0.25.0.so -Dspark.jars=
>> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>> -Dspark.mesos.mesosExecutor.cores=0.1 -Dspark.driver.supervise=f

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni

See below  and Attached the Dockerfile to build the spark image  ( between
i just upgraded to 1.6 )

I am running below setup -

Mesos Master - Docker Container
Mesos Slave 1 - Docker Container
Mesos Slave 2 - Docker Container
Marathon - Docker Container
Spark MESOS Dispatcher - Docker Container

when i submit the Spark PI Example Job using below command

*docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077
<http://10.0.2.15:7077>"  -e SPARK_IMAGE="spark_driver:**latest"
spark_driver:latest ./bin/spark-submit  --deploy-mode cluster --name "PI
Example" --class org.apache.spark.examples.**SparkPi
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
<http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar> --jars
/opt/spark/lib/spark-examples-**1.6.0-hadoop2.6.0.jar --verbose*

Below is the ERROR
Error: Cannot load main class from JAR file:/opt/spark/Example
Run with --help for usage help or --verbose for debug output


When i docker Inspect for the stopped / dead container i see below output
what is interesting to see is some one or executor replaced by original
command with below in highlighted and i do not see Executor is downloading
the JAR -- IS this a BUG i am hitting or not sure if that is supposed to
work this way and i am missing some configuration

"Env": [
"SPARK_IMAGE=spark_driver:latest",
"SPARK_SCALA_VERSION=2.10",
"SPARK_VERSION=1.6.0",
"SPARK_EXECUTOR_URI=
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz";,
"MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos-0.25.0.so",
"SPARK_MASTER=mesos://10.0.2.15:7077",

"SPARK_EXECUTOR_OPTS=-Dspark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/
libmesos-0.25.0.so -Dspark.jars=
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
-Dspark.mesos.mesosExecutor.cores=0.1 -Dspark.driver.supervise=false -
Dspark.app.name=PI Example -Dspark.mesos.uris=
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
-Dspark.mesos.executor.docker.image=spark_driver:latest
-Dspark.submit.deployMode=cluster -Dspark.master=mesos://10.0.2.15:7077
-Dspark.driver.extraClassPath=/opt/spark/custom/lib/*
-Dspark.executor.extraClassPath=/opt/spark/custom/lib/*
-Dspark.executor.uri=
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
-Dspark.mesos.executor.home=/opt/spark",
"MESOS_SANDBOX=/mnt/mesos/sandbox",

"MESOS_CONTAINER_NAME=mesos-e47f8d4c-5ee1-4d01-ad07-0d9a03ced62d-S1.43c08f82-e508-4d57-8c0b-fa05bee77fd6",

"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HADOOP_VERSION=2.6",
"SPARK_HOME=/opt/spark"
],
"Cmd": [
"-c",
   * "./bin/spark-submit --name PI Example --master
mesos://10.0.2.15:5050 <http://10.0.2.15:5050> --driver-cores 1.0
--driver-memory 1024M --class org.apache.spark.examples.SparkPi
$MESOS_SANDBOX/spark-examples-1.6.0-hadoop2.6.0.jar --jars
/opt/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar --verbose"*
],
"Image": "spark_driver:latest",












On Wed, Mar 2, 2016 at 5:49 PM, Charles Allen  wrote:

> @Tim yes, this is asking about 1.5 though
>
> On Wed, Mar 2, 2016 at 2:35 PM Tim Chen  wrote:
>
>> Hi Charles,
>>
>> I thought that's fixed with your patch in latest master now right?
>>
>> Ashish, yes please give me your docker image name (if it's in the public
>> registry) and what you've tried and I can see what's wrong. I think it's
>> most likely just the configuration of where the Spark home folder is in the
>> image.
>>
>> Tim
>>
>> On Wed, Mar 2, 2016 at 2:28 PM, Charles Allen <
>> charles.al...@metamarkets.com> wrote:
>>
>>> Re: Spark on Mesos Warning regarding disk space:
>>> https://issues.apache.org/jira/browse/SPARK-12330
>>>
>>> That's a spark flaw I encountered on a very regular basis on mesos. That
>>> and a few other annoyances are fixed in
>>> https://github.com/metamx/spark/tree/v1.5.2-mmx
>>>
>>> Here's another mild annoyance I've encountered:
>>> https://issues.apache.org/jira/browse/SPARK-11714
>>>
>>> On Wed, Mar 2, 2016 at 1:31 PM Ashish Soni 
>>> wrote:
>>>
>>>> I have no luck and i would to ask the question to spark committers will
>>>> this be ever designed to run on mesos ?
>>>>
>>>> spark app as a docker container not working at all on mesos  ,if any
>>>> one would like the code i can send it over to have a look.
>>>>
>>>> Ashish
>>>

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni

I have no luck and i would to ask the question to spark committers will
this be ever designed to run on mesos ?

spark app as a docker container not working at all on mesos  ,if any one
would like the code i can send it over to have a look.

Ashish

On Wed, Mar 2, 2016 at 12:23 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Try passing jar using --jars option
>
> On Wed, Mar 2, 2016 at 10:17 AM Ashish Soni  wrote:
>
>> I made some progress but now i am stuck at this point , Please help as
>> looks like i am close to get it working
>>
>> I have everything running in docker container including mesos slave and
>> master
>>
>> When i try to submit the pi example i get below error
>> *Error: Cannot load main class from JAR file:/opt/spark/Example*
>>
>> Below is the command i use to submit as a docker container
>>
>> docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
>> SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
>> --deploy-mode cluster --name "PI Example" --class
>> org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory
>> 512m --executor-cores 1
>> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>>
>>
>> On Tue, Mar 1, 2016 at 2:59 PM, Timothy Chen  wrote:
>>
>>> Can you go through the Mesos UI and look at the driver/executor log from
>>> steer file and see what the problem is?
>>>
>>> Tim
>>>
>>> On Mar 1, 2016, at 8:05 AM, Ashish Soni  wrote:
>>>
>>> Not sure what is the issue but i am getting below error  when i try to
>>> run spark PI example
>>>
>>> Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
>>>due to too many failures; is Spark installed on it?
>>> WARN TaskSchedulerImpl: Initial job has not accepted any resources; 
>>> check your cluster UI to ensure that workers are registered and have 
>>> sufficient resources
>>>
>>>
>>> On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
>>> vsathishkuma...@gmail.com> wrote:
>>>
>>>> May be the Mesos executor couldn't find spark image or the constraints
>>>> are not satisfied. Check your Mesos UI if you see Spark application in the
>>>> Frameworks tab
>>>>
>>>> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni 
>>>> wrote:
>>>>
>>>>> What is the Best practice , I have everything running as docker
>>>>> container in single host ( mesos and marathon also as docker container )
>>>>>  and everything comes up fine but when i try to launch the spark shell i
>>>>> get below error
>>>>>
>>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val data = sc.parallelize(1 to 100)
>>>>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>>>>> parallelize at :27
>>>>>
>>>>> scala> data.count
>>>>> [Stage 0:>  (0
>>>>> + 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>>> registered and have sufficient resources
>>>>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>>>>> any resources; check your cluster UI to ensure that workers are registered
>>>>> and have sufficient resources
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen  wrote:
>>>>>
>>>>>> No you don't have to run Mesos in docker containers to run Spark in
>>>>>> docker containers.
>>>>>>
>>>>>> Once you have Mesos cluster running you can then specfiy the Spark
>>>>>> configurations in your Spark job (i.e: 
>>>>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>>>>> and Mesos will automatically launch docker containers for you.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni 
>>>>>> wrote:
>>>>>>
>>>>>>> Yes i read that and not much details here.
>>>>>>>
>>>>>>> Is it true that we need to have spark installed on each mesos docker
>>>>>>> container ( master and slave ) ...
>>>>>>>
>>>>>>> Ashish
>>>>>>>
>>>>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>>>>>>>
>>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should
>>>>>>>> be the best source, what problems were you running into?
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Have you read this ?
>>>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>>>>
>>>>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <
>>>>>>>>> asoni.le...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All ,
>>>>>>>>>>
>>>>>>>>>> Is there any proper documentation as how to run spark on mesos ,
>>>>>>>>>> I am trying from the last few days and not able to make it work.
>>>>>>>>>>
>>>>>>>>>> Please help
>>>>>>>>>>
>>>>>>>>>> Ashish
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni

I made some progress but now i am stuck at this point , Please help as
looks like i am close to get it working

I have everything running in docker container including mesos slave and
master

When i try to submit the pi example i get below error
*Error: Cannot load main class from JAR file:/opt/spark/Example*

Below is the command i use to submit as a docker container

docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
--deploy-mode cluster --name "PI Example" --class
org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory
512m --executor-cores 1
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar


On Tue, Mar 1, 2016 at 2:59 PM, Timothy Chen  wrote:

> Can you go through the Mesos UI and look at the driver/executor log from
> steer file and see what the problem is?
>
> Tim
>
> On Mar 1, 2016, at 8:05 AM, Ashish Soni  wrote:
>
> Not sure what is the issue but i am getting below error  when i try to run
> spark PI example
>
> Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
>due to too many failures; is Spark installed on it?
> WARN TaskSchedulerImpl: Initial job has not accepted any resources; check 
> your cluster UI to ensure that workers are registered and have sufficient 
> resources
>
>
> On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> May be the Mesos executor couldn't find spark image or the constraints
>> are not satisfied. Check your Mesos UI if you see Spark application in the
>> Frameworks tab
>>
>> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni 
>> wrote:
>>
>>> What is the Best practice , I have everything running as docker
>>> container in single host ( mesos and marathon also as docker container )
>>>  and everything comes up fine but when i try to launch the spark shell i
>>> get below error
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val data = sc.parallelize(1 to 100)
>>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>>> parallelize at :27
>>>
>>> scala> data.count
>>> [Stage 0:>  (0 +
>>> 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient resources
>>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>>> any resources; check your cluster UI to ensure that workers are registered
>>> and have sufficient resources
>>>
>>>
>>>
>>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen  wrote:
>>>
>>>> No you don't have to run Mesos in docker containers to run Spark in
>>>> docker containers.
>>>>
>>>> Once you have Mesos cluster running you can then specfiy the Spark
>>>> configurations in your Spark job (i.e: 
>>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>>> and Mesos will automatically launch docker containers for you.
>>>>
>>>> Tim
>>>>
>>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni 
>>>> wrote:
>>>>
>>>>> Yes i read that and not much details here.
>>>>>
>>>>> Is it true that we need to have spark installed on each mesos docker
>>>>> container ( master and slave ) ...
>>>>>
>>>>> Ashish
>>>>>
>>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>>>>>
>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>>>>> the best source, what problems were you running into?
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang 
>>>>>> wrote:
>>>>>>
>>>>>>> Have you read this ?
>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>>
>>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni >>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi All ,
>>>>>>>>
>>>>>>>> Is there any proper documentation as how to run spark on mesos , I
>>>>>>>> am trying from the last few days and not able to make it work.
>>>>>>>>
>>>>>>>> Please help
>>>>>>>>
>>>>>>>> Ashish
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Spark Submit using Convert to Marthon REST API

2016-03-01 Thread Ashish Soni

Hi All ,

Can some one please help me how do i translate below spark submit to
marathon JSON request

docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:5050"  -e
SPARK_IMAGE="spark_driver:latest" spark_driver:latest
/opt/spark/bin/spark-submit  --name "PI Example" --class
org.apache.spark.examples.SparkPi --driver-memory 1g --executor-memory 1g
--executor-cores 1 /opt/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar

Thanks,

Re: Spark 1.5 on Mesos

2016-03-01 Thread Ashish Soni

Not sure what is the issue but i am getting below error  when i try to run
spark PI example

Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
   due to too many failures; is Spark installed on it?
WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources


On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> May be the Mesos executor couldn't find spark image or the constraints are
> not satisfied. Check your Mesos UI if you see Spark application in the
> Frameworks tab
>
> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni 
> wrote:
>
>> What is the Best practice , I have everything running as docker container
>> in single host ( mesos and marathon also as docker container )  and
>> everything comes up fine but when i try to launch the spark shell i get
>> below error
>>
>>
>> SQL context available as sqlContext.
>>
>> scala> val data = sc.parallelize(1 to 100)
>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>> parallelize at :27
>>
>> scala> data.count
>> [Stage 0:>  (0 +
>> 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient resources
>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>> any resources; check your cluster UI to ensure that workers are registered
>> and have sufficient resources
>>
>>
>>
>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen  wrote:
>>
>>> No you don't have to run Mesos in docker containers to run Spark in
>>> docker containers.
>>>
>>> Once you have Mesos cluster running you can then specfiy the Spark
>>> configurations in your Spark job (i.e: 
>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>> and Mesos will automatically launch docker containers for you.
>>>
>>> Tim
>>>
>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni 
>>> wrote:
>>>
>>>> Yes i read that and not much details here.
>>>>
>>>> Is it true that we need to have spark installed on each mesos docker
>>>> container ( master and slave ) ...
>>>>
>>>> Ashish
>>>>
>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>>>>
>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>>>> the best source, what problems were you running into?
>>>>>
>>>>> Tim
>>>>>
>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:
>>>>>
>>>>>> Have you read this ?
>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All ,
>>>>>>>
>>>>>>> Is there any proper documentation as how to run spark on mesos , I
>>>>>>> am trying from the last few days and not able to make it work.
>>>>>>>
>>>>>>> Please help
>>>>>>>
>>>>>>> Ashish
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni

What is the Best practice , I have everything running as docker container
in single host ( mesos and marathon also as docker container )  and
everything comes up fine but when i try to launch the spark shell i get
below error


SQL context available as sqlContext.

scala> val data = sc.parallelize(1 to 100)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at :27

scala> data.count
[Stage 0:>  (0 + 0)
/ 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient resources
16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources



On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen  wrote:

> No you don't have to run Mesos in docker containers to run Spark in docker
> containers.
>
> Once you have Mesos cluster running you can then specfiy the Spark
> configurations in your Spark job (i.e: 
> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
> and Mesos will automatically launch docker containers for you.
>
> Tim
>
> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni 
> wrote:
>
>> Yes i read that and not much details here.
>>
>> Is it true that we need to have spark installed on each mesos docker
>> container ( master and slave ) ...
>>
>> Ashish
>>
>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:
>>
>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>> the best source, what problems were you running into?
>>>
>>> Tim
>>>
>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:
>>>
>>>> Have you read this ?
>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>
>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
>>>> wrote:
>>>>
>>>>> Hi All ,
>>>>>
>>>>> Is there any proper documentation as how to run spark on mesos , I am
>>>>> trying from the last few days and not able to make it work.
>>>>>
>>>>> Please help
>>>>>
>>>>> Ashish
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni

Yes i read that and not much details here.

Is it true that we need to have spark installed on each mesos docker
container ( master and slave ) ...

Ashish

On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen  wrote:

> https://spark.apache.org/docs/latest/running-on-mesos.html should be the
> best source, what problems were you running into?
>
> Tim
>
> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang  wrote:
>
>> Have you read this ?
>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>
>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni 
>> wrote:
>>
>>> Hi All ,
>>>
>>> Is there any proper documentation as how to run spark on mesos , I am
>>> trying from the last few days and not able to make it work.
>>>
>>> Please help
>>>
>>> Ashish
>>>
>>
>>
>

Spark 1.5 on Mesos

2016-02-26 Thread Ashish Soni

Hi All ,

Is there any proper documentation as how to run spark on mesos , I am
trying from the last few days and not able to make it work.

Please help

Ashish

Communication between two spark streaming Job

2016-02-19 Thread Ashish Soni

Hi ,

Is there any way we can communicate across two different spark streaming
job , as below is the scenario

we have two spark streaming job one to process metadata and one to process
actual data ( this needs metadata )

So if someone did the metadata update we need to update the cache
maintained in the second job so that it can take use of new metadata

Please help

Ashish

SPARK-9559

2016-02-18 Thread Ashish Soni

Hi All ,

Just wanted to know if there is any work around or resolution for below
issue in Stand alone mode

https://issues.apache.org/jira/browse/SPARK-9559

Ashish

Seperate Log4j.xml for Spark and Application JAR ( Application vs Spark )

2016-02-12 Thread Ashish Soni

Hi All ,

As per my best understanding we can have only one log4j for both spark and
application as which ever comes first in the classpath takes precedence ,
Is there any way we can keep one in application and one in the spark conf
folder .. is it possible ?

Thanks

Re: Spark Submit

2016-02-12 Thread Ashish Soni

it works as below

spark-submit --conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml" --conf
spark.executor.memory=512m

Thanks all for the quick help.



On Fri, Feb 12, 2016 at 10:59 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:

> Try
> spark-submit  --conf "spark.executor.memory=512m" --conf
> "spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml"
>
> Sent from Samsung Mobile.
>
>
>  Original message ----
> From: Ted Yu 
> Date:12/02/2016 21:24 (GMT+05:30)
> To: Ashish Soni 
> Cc: user 
> Subject: Re: Spark Submit
>
> Have you tried specifying multiple '--conf key=value' ?
>
> Cheers
>
> On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni 
> wrote:
>
>> Hi All ,
>>
>> How do i pass multiple configuration parameter while spark submit
>>
>> Please help i am trying as below
>>
>> spark-submit  --conf "spark.executor.memory=512m
>> spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"
>>
>> Thanks,
>>
>
>

Spark Submit

2016-02-12 Thread Ashish Soni

Hi All ,

How do i pass multiple configuration parameter while spark submit

Please help i am trying as below

spark-submit  --conf "spark.executor.memory=512m
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"

Thanks,

Example of onEnvironmentUpdate Listener

2016-02-08 Thread Ashish Soni

Are there any examples as how to implement onEnvironmentUpdate method for
customer listener

Thanks,

Dynamically Change Log Level Spark Streaming

2016-02-08 Thread Ashish Soni

Hi All ,

How do change the log level for the running spark streaming Job , Any help
will be appriciated.

Thanks,

Redirect Spark Logs to Kafka

2016-02-01 Thread Ashish Soni

Hi All ,

Please let me know how we can redirect spark logging files or tell spark to
log to kafka queue instead of files ..

Ashish

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni

Correct what i am trying to achieve is that before the streaming job starts
query the topic meta data from kafka , determine all the partition and
provide those to direct API.

So my question is should i consider passing all the partition from command
line and query kafka and find and provide , what is the correct approach.

Ashish

On Mon, Jan 25, 2016 at 11:38 AM, Gerard Maas  wrote:

> What are you trying to achieve?
>
> Looks like you want to provide offsets but you're not managing them
> and I'm assuming you're using the direct stream approach.
>
> In that case, use the simpler constructor that takes the kafka config and
> the topics. Let it figure it out the offsets (it will contact kafka and
> request the partitions for the topics provided)
>
> KafkaUtils.createDirectStream[...](ssc, kafkaConfig, topics)
>
>  -kr, Gerard
>
> On Mon, Jan 25, 2016 at 5:31 PM, Ashish Soni 
> wrote:
>
>> Hi All ,
>>
>> What is the best way to tell spark streaming job for the no of partition
>> to to a given topic -
>>
>> Should that be provided as a parameter or command line argument
>> or
>> We should connect to kafka in the driver program and query it
>>
>> Map fromOffsets = new HashMap> Long>();
>> fromOffsets.put(new TopicAndPartition(driverArgs.inputTopic, 0), 0L);
>>
>> Thanks,
>> Ashish
>>
>
>

Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni

Hi All ,

What is the best way to tell spark streaming job for the no of partition to
to a given topic -

Should that be provided as a parameter or command line argument
or
We should connect to kafka in the driver program and query it

Map fromOffsets = new HashMap();
fromOffsets.put(new TopicAndPartition(driverArgs.inputTopic, 0), 0L);

Thanks,
Ashish

How to change the no of cores assigned for a Submitted Job

2016-01-12 Thread Ashish Soni

Hi ,

I have a strange behavior when i creating standalone spark container using
docker
Not sure why by default it is assigning 4 cores to the first Job it submit
and then all the other jobs are in wait state  , Please suggest if there is
an setting to change this

i tried --executor-cores 1 but it has no effect

[image: Inline image 1]

Deployment and performance related queries for Spark and Cassandra

2015-12-21 Thread Ashish Gadkari

Hi

We have configured total* 11 nodes*. Each node contains 8 cores and 32 GB
RAM

*Technologies and their version:*
Apache Spark 1.5.2 and YARN  : 6 nodes
DSE 4.7 [Cassandra 2.1.8 and Solr]  : 5 nodes
HDFS (Hadoop version 2.7.1): 3 nodes

*Stack:*
3 separate nodes for HDFS
3 separate nodes for Spark + YARN
2 separate seed nodes for DSE Cassandra
3 nodes share Cassandra and Spark both

HDFS and Cassandra Replication factor : 3
Used DSE Solr for indexing records in cassandra.
Programming Codi in Java.


*Job flow:*

   1. Driver program to initialize spark and cassandra with 2 seed nodes
   2. Fetch json file from HDFS.
   3. mappartitions on files and using FlatMap function to iterate over data
   4. Each line from file represents a record. In FlatMap function, We use
   gson to convert json to POJO
   5. Invoke solr HTTP GET request based on the fields of POJO. We invoke
   roughly 10 HTTP requests per POJO constructed in previous step. HTTP
   request have any one of 5 Cassandra IPs for distributing GET request load
   across nodes.
   6. These POJOs are collected in an arraylist and returned to driver
   7. We then invoke the mapToRow function to insert these RDDs into
   cassandra.


*Queries:*

   1. Deployment- From the deployment standpoint, does the technology stack
   on each node make sense?
   2. How to determine the partitions size. We are currently using formula
   => size in MB / 16. Should we determine the number of cores, executors and
   memory based on data size or number of rows in the file.
   3. TableWriter issue - While writing RDDs into cassandra, computation
   processes halt and take more time to complete. We are using YJP-profiler
   for monitoring these stats.How to overcome this latency.
   4. Are there any performance related parameters in Spark, Cassandra,
   Solr which will reduce the job time


Any help to increase the performance will be appreciated.
Thanks


-- 
Ashish Gadkari

Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Ashish Nigam

Hi,
I run spark streaming job in cluster mode. This means that driver can run
in any data node. And Spark UI can run in any dynamic port.
At present, I know about the port by looking at container logs that look
something like this -

server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:50571
INFO util.Utils: Successfully started service 'SparkUI' on port 50571.
INFO ui.SparkUI: Started SparkUI at http://xxx:50571


Is there any way to know about the UI port automatically using some API?

Thanks
Ashish

Re: Save GraphX to disk

2015-11-20 Thread Ashish Rawat

Hi Todd,

Could you please provide an example of doing this. Mazerunner seems to be doing 
something similar with Neo4j but it goes via hdfs and updates only the graph 
properties. Is there a direct way to do this with Neo4j or Titan?

Regards,
Ashish

From: SLiZn Liu mailto:sliznmail...@gmail.com>>
Date: Saturday, 14 November 2015 7:44 am
To: Gaurav Kumar mailto:gauravkuma...@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: Save GraphX to disk

Hi Gaurav,

Your graph can be saved to graph databases like Neo4j or Titan through their 
drivers, that eventually saved to the disk.

BR,
Todd

Gaurav Kumar
gauravkuma...@gmail.com<mailto:gauravkuma...@gmail.com>>于2015年11月13日 周五22:08写道：
Hi,

I was wondering how to save a graph to disk and load it back again. I know how 
to save vertices and edges to disk and construct the graph from them, not sure 
if there's any method to save the graph itself to disk.

Best Regards,
Gaurav Kumar
Big Data * Data Science * Photography * Music
+91 9953294125

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Ashish Shrowty

Thanks Steve. I built it from source.


On Thu, Oct 22, 2015 at 4:01 PM Steve Loughran 
wrote:

>
> > On 22 Oct 2015, at 15:12, Ashish Shrowty 
> wrote:
> >
> > I understand that there is some incompatibility with the API between
> Hadoop
> > 2.6/2.7 and Amazon AWS SDK where they changed a signature of
> >
> com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold.
> > The JIRA indicates that this would be fixed in Hadoop 2.8.
> > (https://issues.apache.org/jira/browse/HADOOP-12420)
> >
> > My question is - what are people doing today to access S3? I am unable to
> > find an older JAR of the AWS SDK to test with.
>
> its on maven
>
> http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4
>
> >
> > Thanks,
> > Ashish
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-Hadoop2-6-unable-to-write-to-S3-HADOOP-12420-tp25163.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
>

Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Ashish Shrowty

I understand that there is some incompatibility with the API between Hadoop
2.6/2.7 and Amazon AWS SDK where they changed a signature of 
com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold.
The JIRA indicates that this would be fixed in Hadoop 2.8.
(https://issues.apache.org/jira/browse/HADOOP-12420)

My question is - what are people doing today to access S3? I am unable to
find an older JAR of the AWS SDK to test with.

Thanks,
Ashish



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-Hadoop2-6-unable-to-write-to-S3-HADOOP-12420-tp25163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: question on make multiple external calls within each partition

2015-10-05 Thread Ashish Soni

Need more details but you might want to filter the data first ( create multiple 
RDD) and then process.


> On Oct 5, 2015, at 8:35 PM, Chen Song  wrote:
> 
> We have a use case with the following design in Spark Streaming.
> 
> Within each batch,
> * data is read and partitioned by some key
> * forEachPartition is used to process the entire partition
> * within each partition, there are several REST clients created to connect to 
> different REST services
> * for the list of records within each partition, it will call these services, 
> each service call is independent of others; records are just pre-partitioned 
> to make these calls more efficiently.
> 
> I have a question
> * Since each call is time taking and to prevent the calls to be executed 
> sequentially, how can I parallelize the service calls within processing of 
> each partition? Can I just use Scala future within forEachPartition(or 
> mapPartitions)?
> 
> Any suggestions greatly appreciated.
> 
> Chen
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DStream Transformation to save JSON in Cassandra 2.1

2015-10-05 Thread Ashish Soni

try this


You can use dstream.map to conver it to JavaDstream with only the data you
are interested probably return an Pojo of your JSON

and then call foreachRDD and inside that call below line

javaFunctions(rdd).writerBuilder("table", "keyspace",
mapToRow(Class.class)).saveToCassandra();

On Mon, Oct 5, 2015 at 10:14 AM, Prateek .  wrote:

> Hi,
>
> I am beginner in Spark , this is sample data I get from Kafka stream:
>
> {"id":
> "9f5ccb3d5f4f421392fb98978a6b368f","coordinate":{"ax":"1.20","ay":"3.80","az":"9.90","oa":"8.03","ob":"8.8","og":"9.97"}}
>
>   val lines = KafkaUtils.createStream(ssc, zkQuorum, group,
> topicMap).map(_._2)
>   val jsonf =
> lines.map(JSON.parseFull(_)).map(_.get.asInstanceOf[scala.collection.immutable.Map[String,
> Any]])
>
>   I am getting a, DSTream[Map[String,Any]]. I need to store each
> coordinate values in the below Cassandra schema
>
> CREATE TABLE iotdata.coordinate (
> id text PRIMARY KEY, ax double, ay double, az double, oa double, ob
> double, oz double
> )
>
> For this what transformations I need to apply before I execute
> saveToCassandra().
>
> Thank You,
> Prateek
>
>
> "DISCLAIMER: This message is proprietary to Aricent and is intended solely
> for the use of the individual to whom it is addressed. It may contain
> privileged or confidential information and should not be circulated or used
> for any purpose other than for what it is intended. If you have received
> this message in error, please notify the originator immediately. If you are
> not the intended recipient, you are notified that you are strictly
> prohibited from using, copying, altering, or disclosing the contents of
> this message. Aricent accepts no responsibility for loss or damage arising
> from the use of the information transmitted by this email including damage
> from virus."
>

Re: automatic start of streaming job on failure on YARN

2015-10-02 Thread Ashish Rangole

Are you running the job in yarn cluster mode?
On Oct 1, 2015 6:30 AM, "Jeetendra Gangele"  wrote:

> We've a streaming application running on yarn and we would like to ensure
> that is up running 24/7.
>
> Is there a way to tell yarn to automatically restart a specific
> application on failure?
>
> There is property yarn.resourcemanager.am.max-attempts which is default
> set to 2 setting it to bigger value is the solution? Also I did observed
> this does not seems to work my application is failing and not starting
> automatically.
>
> Mesos has this build in support wondering why yarn is lacking here?
>
>
>
> Regards
>
> jeetendra
>

Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Ashish Soni

I am using Java Streaming context and it doesnt have method setLogLevel and
also i have tried by passing VM argument in eclipse and it doesnt work

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));

Ashish

On Tue, Sep 29, 2015 at 7:23 AM, Adrian Tanase  wrote:

> You should set exta java options for your app via Eclipse project and
> specify something like
>
>  -Dlog4j.configuration=file:/tmp/log4j.properties
>
> Sent from my iPhone
>
> On 28 Sep 2015, at 18:52, Shixiong Zhu  wrote:
>
> You can use JavaSparkContext.setLogLevel to set the log level in your
> codes.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-28 22:55 GMT+08:00 Ashish Soni :
>
>> I am not running it using spark submit , i am running locally inside
>> Eclipse IDE , how i set this using JAVA Code
>>
>> Ashish
>>
>> On Mon, Sep 28, 2015 at 10:42 AM, Adrian Tanase 
>> wrote:
>>
>>> You also need to provide it as parameter to spark submit
>>>
>>> http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver
>>>
>>> From: Ashish Soni
>>> Date: Monday, September 28, 2015 at 5:18 PM
>>> To: user
>>> Subject: Spark Streaming Log4j Inside Eclipse
>>>
>>> I need to turn off the verbose logging of Spark Streaming Code when i am
>>> running inside eclipse i tried creating a log4j.properties file and placed
>>> inside /src/main/resources but i do not see it getting any effect , Please
>>> help as not sure what else needs to be done to change the log at DEBUG or
>>> WARN
>>>
>>
>>
>

Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Ashish Soni

I am not running it using spark submit , i am running locally inside
Eclipse IDE , how i set this using JAVA Code

Ashish

On Mon, Sep 28, 2015 at 10:42 AM, Adrian Tanase  wrote:

> You also need to provide it as parameter to spark submit
>
> http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver
>
> From: Ashish Soni
> Date: Monday, September 28, 2015 at 5:18 PM
> To: user
> Subject: Spark Streaming Log4j Inside Eclipse
>
> I need to turn off the verbose logging of Spark Streaming Code when i am
> running inside eclipse i tried creating a log4j.properties file and placed
> inside /src/main/resources but i do not see it getting any effect , Please
> help as not sure what else needs to be done to change the log at DEBUG or
> WARN
>

Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Ashish Soni

Hi All ,

I need to turn off the verbose logging of Spark Streaming Code when i am
running inside eclipse i tried creating a log4j.properties file and placed
inside /src/main/resources but i do not see it getting any effect , Please
help as not sure what else needs to be done to change the log at DEBUG or
WARN

Ashish

Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni

Hi All ,

Just wanted to find out if there is an benefits to installing  kafka
brokers and spark nodes on the same machine ?

is it possible that spark can pull data from kafka if it is local to the
node i.e. the broker or partition is on the same machine.

Thanks,
Ashish

Spark Cassandra Filtering

2015-09-16 Thread Ashish Soni

Hi ,

How can i pass an dynamic value inside below function to filter instead of
hardcoded
if have an existing RDD and i would like to use data in that for filter so
instead of doing .where("name=?","Anna") i want to do
.where("name=?",someobject.value)

Please help

JavaRDD rdd3 = javaFunctions(sc).cassandraTable("test", "people",
mapRowTo(Person.class))
.where("name=?", "Anna").map(new Function()
{
@Override
public String call(Person person) throws Exception {
return person.toString();
}
});

Dynamic Workflow Execution using Spark

2015-09-15 Thread Ashish Soni

Hi All ,

Are there any framework which can be used to execute workflows with in
spark or Is it possible to use ML Pipeline for workflow execution but not
doing ML .

Thanks,
Ashish

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-10 Thread Ashish Shenoy

Yup thanks Ted. My getPartition() method had a bug where a signed int was
being moduloed with the number of partitions. Fixed that.

Thanks,
Ashish

On Thu, Sep 10, 2015 at 10:44 AM, Ted Yu  wrote:

> Here is snippet of ExternalSorter.scala where ArrayIndexOutOfBoundsException
> was thrown:
>
> while (iterator.hasNext) {
>   val partitionId = iterator.nextPartition()
>   iterator.writeNext(partitionWriters(partitionId))
> }
> Meaning, partitionId was negative.
> Execute the following and examine the value of i:
>
> int i = -78 % 40;
>
> You will see how your getPartition() method should be refined to prevent
> this exception.
>
> On Thu, Sep 10, 2015 at 8:52 AM, Ashish Shenoy 
> wrote:
>
>> I am using spark-1.4.1
>>
>> Here's the skeleton code:
>>
>> JavaPairRDD rddPair =
>>   rdd.repartitionAndSortWithinPartitions(
>>   new CustomPartitioner(), new ExportObjectComparator())
>> .persist(StorageLevel.MEMORY_AND_DISK_SER());
>>
>> ...
>>
>> @SuppressWarnings("serial")
>> private static class CustomPartitioner extends Partitioner {
>>   int numPartitions;
>>   @Override
>>   public int numPartitions() {
>> numPartitions = 40;
>> return numPartitions;
>>   }
>>
>>   @Override
>>   public int getPartition(Object o) {
>> NewKey newKey = (NewKey) o;
>> return (int) newKey.getGsMinusURL() % numPartitions;
>>   }
>> }
>>
>> ...
>>
>> @SuppressWarnings("serial")
>> private static class ExportObjectComparator
>>   implements Serializable, Comparator {
>>   @Override
>>   public int compare(NewKey o1, NewKey o2) {
>> if (o1.hits == o2.hits) {
>>   return 0;
>> } else if (o1.hits > o2.hits) {
>>   return -1;
>> } else {
>>   return 1;
>> }
>>   }
>>
>> }
>>
>> ...
>>
>>
>>
>> Thanks,
>> Ashish
>>
>> On Wed, Sep 9, 2015 at 5:13 PM, Ted Yu  wrote:
>>
>>> Which release of Spark are you using ?
>>>
>>> Can you show skeleton of your partitioner and comparator ?
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Sep 9, 2015, at 4:45 PM, Ashish Shenoy 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I am trying to sort a RDD pair
>>> using repartitionAndSortWithinPartitions() for my key [which is a custom
>>> class, not a java primitive] using a custom partitioner on that key and a
>>> custom comparator. However, it fails consistently:
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 18 in stage 1.0 failed 4 times, most recent failure: Lost task 18.3 in
>>> stage 1.0 (TID 202, 172.16.18.25):
>>> java.lang.ArrayIndexOutOfBoundsException: -78
>>> at
>>> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
>>> at
>>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
>>> at
>>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at
>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.abortSta

Re: ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-10 Thread Ashish Shenoy

I am using spark-1.4.1

Here's the skeleton code:

JavaPairRDD rddPair =
  rdd.repartitionAndSortWithinPartitions(
  new CustomPartitioner(), new ExportObjectComparator())
.persist(StorageLevel.MEMORY_AND_DISK_SER());

...

@SuppressWarnings("serial")
private static class CustomPartitioner extends Partitioner {
  int numPartitions;
  @Override
  public int numPartitions() {
numPartitions = 40;
return numPartitions;
  }

  @Override
  public int getPartition(Object o) {
NewKey newKey = (NewKey) o;
return (int) newKey.getGsMinusURL() % numPartitions;
  }
}

...

@SuppressWarnings("serial")
private static class ExportObjectComparator
  implements Serializable, Comparator {
  @Override
  public int compare(NewKey o1, NewKey o2) {
if (o1.hits == o2.hits) {
  return 0;
} else if (o1.hits > o2.hits) {
  return -1;
} else {
  return 1;
    }
  }

}

...



Thanks,
Ashish

On Wed, Sep 9, 2015 at 5:13 PM, Ted Yu  wrote:

> Which release of Spark are you using ?
>
> Can you show skeleton of your partitioner and comparator ?
>
> Thanks
>
>
>
> On Sep 9, 2015, at 4:45 PM, Ashish Shenoy 
> wrote:
>
> Hi,
>
> I am trying to sort a RDD pair using repartitionAndSortWithinPartitions()
> for my key [which is a custom class, not a java primitive] using a custom
> partitioner on that key and a custom comparator. However, it fails
> consistently:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 18
> in stage 1.0 failed 4 times, most recent failure: Lost task 18.3 in stage
> 1.0 (TID 202, 172.16.18.25): java.lang.ArrayIndexOutOfBoundsException: -78
> at
> org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> I also persist the RDD using the "memory and disk" storage level. The
> stack trace above comes from spark's code and not my application code. Can
> you pls point out what I am doing wrong ?
>
> Thanks,
> Ashish
>
>

ArrayIndexOutOfBoundsException when using repartitionAndSortWithinPartitions()

2015-09-09 Thread Ashish Shenoy

Hi,

I am trying to sort a RDD pair using repartitionAndSortWithinPartitions()
for my key [which is a custom class, not a java primitive] using a custom
partitioner on that key and a custom comparator. However, it fails
consistently:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 18
in stage 1.0 failed 4 times, most recent failure: Lost task 18.3 in stage
1.0 (TID 202, 172.16.18.25): java.lang.ArrayIndexOutOfBoundsException: -78
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

I also persist the RDD using the "memory and disk" storage level. The stack
trace above comes from spark's code and not my application code. Can you
pls point out what I am doing wrong ?

Thanks,
Ashish

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-09 Thread Ashish Dutt

Dear Sasha,

What I did was that I installed the parcels on all the nodes of the
cluster. Typically the location was
/opt/cloudera/parcels/CDH5.4.2-1.cdh5.4.2.p0.2
Hope this helps you.

With regards,
Ashish



On Tue, Sep 8, 2015 at 10:18 PM, Sasha Kacanski  wrote:

> Hi Ashish,
> Thanks for the update.
> I tried all of it, but what I don't get it is that I run cluster with one
> node so presumably I should have PYspark binaries there as I am developing
> on same host.
> Could you tell me where you placed parcels or whatever cloudera is using.
> My understanding of yarn and spark is that these binaries get compressed
> and packaged with Java to be pushed to work node.
> Regards,
> On Sep 7, 2015 9:00 PM, "Ashish Dutt"  wrote:
>
>> Hello Sasha,
>>
>> I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
>> Your question-  "Error from python worker:
>>   /cube/PY/Python27/bin/python: No module named pyspark"
>>
>> On a single node (ie one server/machine/computer) I installed pyspark
>> binaries and it worked. Connected it to pycharm and it worked too.
>>
>> Next I tried executing pyspark command on another node (say the worker)
>> in the cluster and i got this error message, Error from python worker:
>> PATH: No module named pyspark".
>>
>> My first guess was that the worker is not picking up the path of pyspark
>> binaries installed on the server ( I tried many a things like hard-coding
>> the pyspark path in the config.sh file on the worker- NO LUCK; tried
>> dynamic path from the code in pycharm- NO LUCK... ; searched the web and
>> asked the question in almost every online forum--NO LUCK..; banged my head
>> several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
>> 'watermelon' dropped while brooding on this problem and I installed pyspark
>> binaries on all the worker machines ) Now when I tried executing just the
>> command pyspark on the worker's it worked. Tried some simple program
>> snippets on each worker, it works too.
>>
>> I am not sure if this will help or not for your use-case.
>>
>>
>>
>> Sincerely,
>> Ashish
>>
>> On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski 
>> wrote:
>>
>>> Thanks Ashish,
>>> nice blog but does not cover my issue. Actually I have pycharm running
>>> and loading pyspark and rest of libraries perfectly fine.
>>> My issue is that I am not sure what is triggering
>>>
>>> Error from python worker:
>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>> pyspark
>>> PYTHONPATH was:
>>>
>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
>>> 4.1-hadoop2.6.0.jar
>>>
>>> Question is why is yarn not getting python package to run on the single
>>> node via YARN?
>>> Some people are saying run with JAVA 6 due to zip library changes
>>> between 6/7/8, some identified bug w RH, i am on debian,  then some
>>> documentation errors but nothing is really clear.
>>>
>>> i have binaries for spark hadoop and i did just fine with spark sql
>>> module, hive, python, pandas ad yarn.
>>> Locally as i said app is working fine (pandas to spark df to parquet)
>>> But as soon as I move to yarn client mode yarn is not getting packages
>>> required to run app.
>>>
>>> If someone confirms that I need to build everything from source with
>>> specific version of software I will do that, but at this point I am not
>>> sure what to do to remedy this situation...
>>>
>>> --sasha
>>>
>>>
>>> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt 
>>> wrote:
>>>
>>>> Hi Aleksandar,
>>>> Quite some time ago, I faced the same problem and I found a solution
>>>> which I have posted here on my blog
>>>> <https://edumine.wordpress.com/category/apache-spark/>.
>>>> See if that can help you and if it does not then you can check out
>>>> these questions & solution on stackoverflow
>>>> <http://stackoverflow.com/search?q=no+module+named+pyspark> website
>>>>
>>>>
>>>> Sincerely,
>>>> Ashish Dutt
>>>>
>>>>
>>>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I am successfully running python app via pyCharm in local mode
>>>>> setMaster(&qu

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-07 Thread Ashish Dutt

Hello Sasha,

I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
Your question-  "Error from python worker:
  /cube/PY/Python27/bin/python: No module named pyspark"

On a single node (ie one server/machine/computer) I installed pyspark
binaries and it worked. Connected it to pycharm and it worked too.

Next I tried executing pyspark command on another node (say the worker) in
the cluster and i got this error message, Error from python worker: PATH:
No module named pyspark".

My first guess was that the worker is not picking up the path of pyspark
binaries installed on the server ( I tried many a things like hard-coding
the pyspark path in the config.sh file on the worker- NO LUCK; tried
dynamic path from the code in pycharm- NO LUCK... ; searched the web and
asked the question in almost every online forum--NO LUCK..; banged my head
several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
'watermelon' dropped while brooding on this problem and I installed pyspark
binaries on all the worker machines ) Now when I tried executing just the
command pyspark on the worker's it worked. Tried some simple program
snippets on each worker, it works too.

I am not sure if this will help or not for your use-case.

Sincerely,
Ashish

On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski  wrote:

> Thanks Ashish,
> nice blog but does not cover my issue. Actually I have pycharm running and
> loading pyspark and rest of libraries perfectly fine.
> My issue is that I am not sure what is triggering
>
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
> 4.1-hadoop2.6.0.jar
>
> Question is why is yarn not getting python package to run on the single
> node via YARN?
> Some people are saying run with JAVA 6 due to zip library changes between
> 6/7/8, some identified bug w RH, i am on debian,  then some documentation
> errors but nothing is really clear.
>
> i have binaries for spark hadoop and i did just fine with spark sql
> module, hive, python, pandas ad yarn.
> Locally as i said app is working fine (pandas to spark df to parquet)
> But as soon as I move to yarn client mode yarn is not getting packages
> required to run app.
>
> If someone confirms that I need to build everything from source with
> specific version of software I will do that, but at this point I am not
> sure what to do to remedy this situation...
>
> --sasha
>
>
> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt 
> wrote:
>
>> Hi Aleksandar,
>> Quite some time ago, I faced the same problem and I found a solution
>> which I have posted here on my blog
>> <https://edumine.wordpress.com/category/apache-spark/>.
>> See if that can help you and if it does not then you can check out these
>> questions & solution on stackoverflow
>> <http://stackoverflow.com/search?q=no+module+named+pyspark> website
>>
>>
>> Sincerely,
>> Ashish Dutt
>>
>>
>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski 
>> wrote:
>>
>>> Hi,
>>> I am successfully running python app via pyCharm in local mode
>>> setMaster("local[*]")
>>>
>>> When I turn on SparkConf().setMaster("yarn-client")
>>>
>>> and run via
>>>
>>> park-submit PysparkPandas.py
>>>
>>>
>>> I run into issue:
>>> Error from python worker:
>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>> PYTHONPATH was:
>>>
>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>>>
>>> I am running java
>>> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
>>> java version "1.8.0_31"
>>> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>>>
>>> Should I try same thing with java 6/7
>>>
>>> Is this packaging issue or I have something wrong with configurations ...
>>>
>>> Regards,
>>>
>>> --
>>> Aleksandar Kacanski
>>>
>>
>>
>
>
> --
> Aleksandar Kacanski
>

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-06 Thread Ashish Dutt

Hi Aleksandar,
Quite some time ago, I faced the same problem and I found a solution which
I have posted here on my blog
<https://edumine.wordpress.com/category/apache-spark/>.
See if that can help you and if it does not then you can check out these
questions & solution on stackoverflow
<http://stackoverflow.com/search?q=no+module+named+pyspark> website

Sincerely,
Ashish Dutt

On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski  wrote:

> Hi,
> I am successfully running python app via pyCharm in local mode
> setMaster("local[*]")
>
> When I turn on SparkConf().setMaster("yarn-client")
>
> and run via
>
> park-submit PysparkPandas.py
>
>
> I run into issue:
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>
> I am running java
> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>
> Should I try same thing with java 6/7
>
> Is this packaging issue or I have something wrong with configurations ...
>
> Regards,
>
> --
> Aleksandar Kacanski
>

Re: FlatMap Explanation

2015-09-03 Thread Ashish Soni

Thanks a lot everyone.
Very Helpful.

Ashish

On Thu, Sep 3, 2015 at 2:19 AM, Zalzberg, Idan (Agoda) <
idan.zalzb...@agoda.com> wrote:

> Hi,
>
> Yes, I can explain
>
>
>
> 1 to 3 -> 1,2,3
>
> 2 to 3- > 2,3
>
> 3 to 3 -> 3
>
> 3 to 3 -> 3
>
>
>
> Flat map that concatenates the results, so you get
>
>
>
> 1,2,3, 2,3, 3,3
>
>
>
> You should get the same with any scala collection
>
>
>
> Cheers
>
>
>
> *From:* Ashish Soni [mailto:asoni.le...@gmail.com]
> *Sent:* Thursday, September 03, 2015 9:06 AM
> *To:* user 
> *Subject:* FlatMap Explanation
>
>
>
> Hi ,
>
> Can some one please explain the output of the flat map
>
> data in RDD as below
>
> {1, 2, 3, 3}
>
> rdd.flatMap(x => x.to(3))
>
> output as below
>
> {1, 2, 3, 2, 3, 3, 3}
>
> i am not able to understand how the output came as above.
>
> Thanks,
>
> --
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>

FlatMap Explanation

2015-09-02 Thread Ashish Soni

Hi ,

Can some one please explain the output of the flat map
data in RDD as below
{1, 2, 3, 3}

rdd.flatMap(x => x.to(3))

output as below

{1, 2, 3, 2, 3, 3, 3}
i am not able to understand how the output came as above.

Thanks,

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ashish Shrowty

Yes .. I am closing the stream.

Not sure what you meant by "bq. and then create rdd"?

-Ashish

On Mon, Aug 31, 2015 at 1:02 PM Ted Yu  wrote:

> I am not familiar with your code.
>
> bq. and then create the rdd
>
> I assume you call ObjectOutputStream.close() prior to the above step.
>
> Cheers
>
> On Mon, Aug 31, 2015 at 9:42 AM, Ashish Shrowty 
> wrote:
>
>> Sure .. here it is (scroll below to see the NotSerializableException).
>> Note that upstream, I do load up the (user,item,ratings) data from a file
>> using ObjectInputStream, do some calculations that I put in a map and then
>> create the rdd used in the code above from that map. I even tried
>> checkpointing the rdd and persisting it to break any lineage to the
>> original ObjectInputStream (if that was what was happening) -
>>
>> org.apache.spark.SparkException: Task not serializable
>>
>> at
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
>>
>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>>
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
>>
>> at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>>
>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
>>
>> at $iwC$$iwC$$iwC$$iwC.(:48)
>>
>> at $iwC$$iwC$$iwC.(:50)
>>
>> at $iwC$$iwC.(:52)
>>
>> at $iwC.(:54)
>>
>> at (:56)
>>
>> at .(:60)
>>
>> at .()
>>
>> at .(:7)
>>
>> at .()
>>
>> at $print()
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
>>
>> at
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
>>
>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
>>
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
>>
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
>>
>> at org.apache.spark.repl.SparkILoop.pasteCommand(SparkILoop.scala:796)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:321)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:321)
>>
>> at
>> scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
>>
>> at
>> scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
>>
>> at
>> scala.tools.nsc.interpreter.LoopCommands$NullaryCmd.apply(LoopCommands.scala:76)
>>
>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
>>
>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
>>
>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
>>
>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
>>
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
>>
>> at
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>
>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
>>
>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
>>
>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>
>> at org.apache.spark.repl.Main.main(Main.scala)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(Spar

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ashish Shrowty

Sure .. here it is (scroll below to see the NotSerializableException). Note
that upstream, I do load up the (user,item,ratings) data from a file using
ObjectInputStream, do some calculations that I put in a map and then create
the rdd used in the code above from that map. I even tried checkpointing
the rdd and persisting it to break any lineage to the original
ObjectInputStream (if that was what was happening) -

org.apache.spark.SparkException: Task not serializable

at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)

at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)

at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)

at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)

at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)

at $iwC$$iwC$$iwC$$iwC.(:48)

at $iwC$$iwC$$iwC.(:50)

at $iwC$$iwC.(:52)

at $iwC.(:54)

at (:56)

at .(:60)

at .()

at .(:7)

at .()

at $print()

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)

at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)

at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)

at org.apache.spark.repl.SparkILoop.pasteCommand(SparkILoop.scala:796)

at
org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:321)

at
org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:321)

at
scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)

at
scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)

at
scala.tools.nsc.interpreter.LoopCommands$NullaryCmd.apply(LoopCommands.scala:76)

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)

at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)

at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)

at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)

at org.apache.spark.repl.Main$.main(Main.scala:31)

at org.apache.spark.repl.Main.main(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

*Caused by: java.io.NotSerializableException: java.io.ObjectInputStream*

at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)

at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)

at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)

at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)

at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)

at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)

...

...

On Mon, Aug 31, 2015 at 12:23 PM Ted Yu  wrote:

> Ashish:
> Can you post the complete stack trace for NotSerializableException ?
>
> Cheers
>
> On Mon, Aug 31, 2015 at 8:49 AM, Ashish Shrowty 
> wrote:
>
>> bcItemsIdx is just a broadcast variable constructed out of
>> Array[(String)] .. it holds the item ids and I use it for indexing the
>> MatrixEntry objects
>>
>>
>> On Mon, Aug 31, 2015 at 10:41 AM Sean Owen  wrote:
>>
>>> It's not clear; that error is different still and somehow suggests
>>> you're serializing a stream somewhere. I'd look at what's inside
>>> bcItemsIdx as that is not shown here.
>>>
>>> On Mon, Aug 31, 2015 at 3:34 PM, Ashish Shrowty
>>>
>>>  wro

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ashish Shrowty

bcItemsIdx is just a broadcast variable constructed out of Array[(String)]
.. it holds the item ids and I use it for indexing the MatrixEntry objects


On Mon, Aug 31, 2015 at 10:41 AM Sean Owen  wrote:

> It's not clear; that error is different still and somehow suggests
> you're serializing a stream somewhere. I'd look at what's inside
> bcItemsIdx as that is not shown here.
>
> On Mon, Aug 31, 2015 at 3:34 PM, Ashish Shrowty
>  wrote:
> > Sean,
> >
> > Thanks for your comments. What I was really trying to do was to
> transform a
> > RDD[(userid,itemid,ratings)] into a RowMatrix so that I can do some
> column
> > similarity calculations while exploring the data before building some
> > models. But to do that I need to first convert the user and item ids into
> > respective indexes where I intended on passing in an array into the
> closure,
> > which is where I got stuck with this overflowerror trying to figure out
> > where it is happening. The actual error I got was slightly different
> (Caused
> > by: java.io.NotSerializableException: java.io.ObjectInputStream). I
> started
> > investigating this issue which led me to the earlier code snippet that I
> had
> > posted. This is again because of the bcItemsIdx variable being passed
> into
> > the closure. Below code works if I don't pass in the variable and use
> simply
> > a constant like 10 in its place .. The code thus far -
> >
> > // rdd below is RDD[(String,String,Double)]
> > // bcItemsIdx below is Broadcast[Array[String]] which is an array of item
> > ids
> > val gRdd = rdd.map{case(user,item,rating) =>
> > ((user),(item,rating))}.groupByKey
> > val idxRdd = gRdd.zipWithIndex
> > val cm = new CoordinateMatrix(
> > idxRdd.flatMap[MatrixEntry](e => {
> > e._1._2.map(item=> {
> >  MatrixEntry(e._2, bcItemsIdx.value.indexOf(item._1),
> > item._2) // <- This is where I get the Serialization error passing in the
> > index
> >  // MatrixEntry(e._2, 10, item._2) // <- This works
> > })
> > })
> > )
> > val rm = cm.toRowMatrix
> > val simMatrix = rm.columnSimilarities()
> >
> > I would like to make this work in the Spark shell as I am still exploring
> > the data. Let me know if there is an alternate way of constructing the
> > RowMatrix.
> >
> > Thanks and appreciate all the help!
> >
> > Ashish
> >
> > On Mon, Aug 31, 2015 at 3:41 AM Sean Owen  wrote:
> >>
> >> Yeah I see that now. I think it fails immediately because the map
> >> operation does try to clean and/or verify the serialization of the
> >> closure upfront.
> >>
> >> I'm not quite sure what is going on, but I think it's some strange
> >> interaction between how you're building up the list and what the
> >> resulting representation happens to be like, and how the closure
> >> cleaner works, which can't be perfect. The shell also introduces an
> >> extra layer of issues.
> >>
> >> For example, the slightly more canonical approaches work fine:
> >>
> >> import scala.collection.mutable.MutableList
> >> val lst = MutableList[(String,String,Double)]()
> >> (0 to 1).foreach(i => lst :+ ("10", "10", i.toDouble))
> >>
> >> or just
> >>
> >> val lst = (0 to 1).map(i => ("10", "10", i.toDouble))
> >>
> >> If you just need this to work, maybe those are better alternatives
> anyway.
> >> You can also check whether it works without the shell, as I suspect
> >> that's a factor.
> >>
> >> It's not an error in Spark per se but saying that something's default
> >> Java serialization graph is very deep, so it's like the code you wrote
> >> plus the closure cleaner ends up pulling in some huge linked list and
> >> serializing it the direct and unuseful way.
> >>
> >> If you have an idea about exactly why it's happening you can open a
> >> JIRA, but arguably it's something that's nice to just work but isn't
> >> to do with Spark per se. Or, have a look at others related to the
> >> closure and shell and you may find this is related to other known
> >> behavior.
> >>
> >>
> >> On Sun, Aug 30, 2015 at 8:08 PM, Ashish Shrowty
> >>  wrote:
> >> > Sean .. does the code below work for you in the Spark shell? Ted got
> the
> >> > same error -

Re: Spark shell and StackOverFlowError

2015-08-30 Thread Ashish Shrowty

Do you think I should create a JIRA?


On Sun, Aug 30, 2015 at 12:56 PM Ted Yu  wrote:

> I got StackOverFlowError as well :-(
>
> On Sun, Aug 30, 2015 at 9:47 AM, Ashish Shrowty 
> wrote:
>
>> Yep .. I tried that too earlier. Doesn't make a difference. Are you able
>> to replicate on your side?
>>
>>
>> On Sun, Aug 30, 2015 at 12:08 PM Ted Yu  wrote:
>>
>>> I see.
>>>
>>> What about using the following in place of variable a ?
>>>
>>> http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
>>>
>>> Cheers
>>>
>>> On Sun, Aug 30, 2015 at 8:54 AM, Ashish Shrowty <
>>> ashish.shro...@gmail.com> wrote:
>>>
>>>> @Sean - Agree that there is no action, but I still get the
>>>> stackoverflowerror, its very weird
>>>>
>>>> @Ted - Variable a is just an int - val a = 10 ... The error happens
>>>> when I try to pass a variable into the closure. The example you have above
>>>> works fine since there is no variable being passed into the closure from
>>>> the shell.
>>>>
>>>> -Ashish
>>>>
>>>> On Sun, Aug 30, 2015 at 9:55 AM Ted Yu  wrote:
>>>>
>>>>> Using Spark shell :
>>>>>
>>>>> scala> import scala.collection.mutable.MutableList
>>>>> import scala.collection.mutable.MutableList
>>>>>
>>>>> scala> val lst = MutableList[(String,String,Double)]()
>>>>> lst: scala.collection.mutable.MutableList[(String, String, Double)] =
>>>>> MutableList()
>>>>>
>>>>> scala> Range(0,1).foreach(i=>lst+=(("10","10",i:Double)))
>>>>>
>>>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>>>>> :27: error: not found: value a
>>>>>val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>>>>>   ^
>>>>>
>>>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(i._1==10) 1 else 0)
>>>>> rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at
>>>>> :27
>>>>>
>>>>> scala> rdd.count()
>>>>> ...
>>>>> 15/08/30 06:53:40 INFO DAGScheduler: Job 0 finished: count at
>>>>> :30, took 0.478350 s
>>>>> res1: Long = 1
>>>>>
>>>>> Ashish:
>>>>> Please refine your example to mimic more closely what your code
>>>>> actually did.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Sun, Aug 30, 2015 at 12:24 AM, Sean Owen 
>>>>> wrote:
>>>>>
>>>>>> That can't cause any error, since there is no action in your first
>>>>>> snippet. Even calling count on the result doesn't cause an error. You
>>>>>> must be executing something different.
>>>>>>
>>>>>> On Sun, Aug 30, 2015 at 4:21 AM, ashrowty 
>>>>>> wrote:
>>>>>> > I am running the Spark shell (1.2.1) in local mode and I have a
>>>>>> simple
>>>>>> > RDD[(String,String,Double)] with about 10,000 objects in it. I get a
>>>>>> > StackOverFlowError each time I try to run the following code (the
>>>>>> code
>>>>>> > itself is just representative of other logic where I need to pass
>>>>>> in a
>>>>>> > variable). I tried broadcasting the variable too, but no luck ..
>>>>>> missing
>>>>>> > something basic here -
>>>>>> >
>>>>>> > val rdd = sc.makeRDD(List()
>>>>>> > val a=10
>>>>>> > rdd.map(r => if (a==10) 1 else 0)
>>>>>> > This throws -
>>>>>> >
>>>>>> > java.lang.StackOverflowError
>>>>>> > at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:318)
>>>>>> > at
>>>>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
>>>>>> > at
>>>>>> >
>>>>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>>>>> > at
>>>>>> >
>>>>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java

Re: Spark shell and StackOverFlowError

2015-08-30 Thread Ashish Shrowty

Yep .. I tried that too earlier. Doesn't make a difference. Are you able to
replicate on your side?


On Sun, Aug 30, 2015 at 12:08 PM Ted Yu  wrote:

> I see.
>
> What about using the following in place of variable a ?
>
> http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
>
> Cheers
>
> On Sun, Aug 30, 2015 at 8:54 AM, Ashish Shrowty 
> wrote:
>
>> @Sean - Agree that there is no action, but I still get the
>> stackoverflowerror, its very weird
>>
>> @Ted - Variable a is just an int - val a = 10 ... The error happens when
>> I try to pass a variable into the closure. The example you have above works
>> fine since there is no variable being passed into the closure from the
>> shell.
>>
>> -Ashish
>>
>> On Sun, Aug 30, 2015 at 9:55 AM Ted Yu  wrote:
>>
>>> Using Spark shell :
>>>
>>> scala> import scala.collection.mutable.MutableList
>>> import scala.collection.mutable.MutableList
>>>
>>> scala> val lst = MutableList[(String,String,Double)]()
>>> lst: scala.collection.mutable.MutableList[(String, String, Double)] =
>>> MutableList()
>>>
>>> scala> Range(0,1).foreach(i=>lst+=(("10","10",i:Double)))
>>>
>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>>> :27: error: not found: value a
>>>val rdd=sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>>>   ^
>>>
>>> scala> val rdd=sc.makeRDD(lst).map(i=> if(i._1==10) 1 else 0)
>>> rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at
>>> :27
>>>
>>> scala> rdd.count()
>>> ...
>>> 15/08/30 06:53:40 INFO DAGScheduler: Job 0 finished: count at
>>> :30, took 0.478350 s
>>> res1: Long = 1
>>>
>>> Ashish:
>>> Please refine your example to mimic more closely what your code actually
>>> did.
>>>
>>> Thanks
>>>
>>> On Sun, Aug 30, 2015 at 12:24 AM, Sean Owen  wrote:
>>>
>>>> That can't cause any error, since there is no action in your first
>>>> snippet. Even calling count on the result doesn't cause an error. You
>>>> must be executing something different.
>>>>
>>>> On Sun, Aug 30, 2015 at 4:21 AM, ashrowty 
>>>> wrote:
>>>> > I am running the Spark shell (1.2.1) in local mode and I have a simple
>>>> > RDD[(String,String,Double)] with about 10,000 objects in it. I get a
>>>> > StackOverFlowError each time I try to run the following code (the code
>>>> > itself is just representative of other logic where I need to pass in a
>>>> > variable). I tried broadcasting the variable too, but no luck ..
>>>> missing
>>>> > something basic here -
>>>> >
>>>> > val rdd = sc.makeRDD(List()
>>>> > val a=10
>>>> > rdd.map(r => if (a==10) 1 else 0)
>>>> > This throws -
>>>> >
>>>> > java.lang.StackOverflowError
>>>> > at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:318)
>>>> > at
>>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
>>>> > at
>>>> >
>>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>>> > at
>>>> >
>>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>>> > at
>>>> >
>>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>>> > at
>>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>>> > at
>>>> >
>>>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>>>> > at
>>>> >
>>>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>>>> > at
>>>> >
>>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>>>> > ...
>>>> > ...
>>>> >
>>>> > More experiments  .. this works -
>>>> >
>>>> > val lst = Range(0,1).map(i=>("10","10",i:Double)).toList
>>>> > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>>>> >
>>>> > But below doesn't and throws the StackoverflowError -
>>>> >
>>>> > val lst = MutableList[(String,String,Double)]()
>>>> > Range(0,1).foreach(i=>lst+=(("10","10",i:Double)))
>>>> > sc.makeRDD(lst).map(i=> if(a==10) 1 else 0)
>>>> >
>>>> > Any help appreciated!
>>>> >
>>>> > Thanks,
>>>> > Ashish
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508.html
>>>> > Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >
>>>> > -
>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ashish Rangole

I suggest taking a heap dump of driver process using jmap. Then open that
dump in a tool like Visual VM to see which object(s) are taking up heap
space. It is easy to do. We did this and found out that in our case it was
the data structure that stores info about stages, jobs and tasks. There can
be other reasons as well, of course.
On Aug 27, 2015 4:17 AM,  wrote:

> I should have mentioned: yes I am using Kryo and have registered KeyClass
> and ValueClass.
>
>
>
> I guess it’s not clear to me what is actually taking up space on the
> driver heap - I can’t see how it can be data with the code that I have.
>
> On 27/08/2015 12:09, "Ewan Leith"  wrote:
>
> >Are you using the Kryo serializer? If not, have a look at it, it can save
> a lot of memory during shuffles
> >
> >https://spark.apache.org/docs/latest/tuning.html
> >
> >I did a similar task and had various issues with the volume of data being
> parsed in one go, but that helped a lot. It looks like the main difference
> from what you're doing to me is that my input classes were just a string
> and a byte array, which I then processed once it was read into the RDD,
> maybe your classes are memory heavy?
> >
> >
> >Thanks,
> >Ewan
> >
> >-Original Message-
> >From: andrew.row...@thomsonreuters.com [mailto:
> andrew.row...@thomsonreuters.com]
> >Sent: 27 August 2015 11:53
> >To: user@spark.apache.org
> >Subject: Driver running out of memory - caused by many tasks?
> >
> >I have a spark v.1.4.1 on YARN job where the first stage has ~149,000
> tasks (it’s reading a few TB of data). The job itself is fairly simple -
> it’s just getting a list of distinct values:
> >
> >val days = spark
> >  .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
> >  .sample(withReplacement = false, fraction = 0.01)
> >  .map(row => row._1.getTimestamp.toString("-MM-dd"))
> >  .distinct()
> >  .collect()
> >
> >The cardinality of the ‘day’ is quite small - there’s only a handful.
> However, I’m frequently running into OutOfMemory issues on the driver. I’ve
> had it fail with 24GB RAM, and am currently nudging it upwards to find out
> where it works. The ratio between input and shuffle write in the distinct
> stage is about 3TB:7MB. On a smaller dataset, it works without issue on a
> smaller (4GB) heap. In YARN cluster mode, I get a failure message similar
> to:
> >
> >Container
> [pid=36844,containerID=container_e15_1438040390147_4982_01_01] is
> running beyond physical memory limits. Current usage: 27.6 GB of 27 GB
> physical memory used; 29.5 GB of 56.7 GB virtual memory used. Killing
> container.
> >
> >
> >Is the driver running out of memory simply due to the number of tasks, or
> is there something about the job program that’s causing it to put a lot of
> data into the driver heap and go oom? If the former, is there any general
> guidance about the amount of memory to give to the driver as a function of
> how many tasks there are?
> >
> >Andrew

Re: Worker Machine running out of disk for Long running Streaming process

2015-08-22 Thread Ashish Rangole

Interesting. TD, can you please throw some light on why this is and point
to  the relevant code in Spark repo. It will help in a better understanding
of things that can affect a long running streaming job.
On Aug 21, 2015 1:44 PM, "Tathagata Das"  wrote:

> Could you periodically (say every 10 mins) run System.gc() on the driver.
> The cleaning up shuffles is tied to the garbage collection.
>
>
> On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma 
> wrote:
>
>> Hi All,
>>
>>
>> I have a 24x7 running Streaming Process, which runs on 2 hour windowed
>> data
>>
>> The issue i am facing is my worker machines are running OUT OF DISK space
>>
>> I checked that the SHUFFLE FILES are not getting cleaned up.
>>
>>
>> /log/spark-2b875d98-1101-4e61-86b4-67c9e71954cc/executor-5bbb53c1-cee9-4438-87a2-b0f2becfac6f/blockmgr-c905b93b-c817-4124-a774-be1e706768c1//00/shuffle_2739_5_0.data
>>
>> Ultimately the machines runs out of Disk Spac
>>
>>
>> i read about *spark.cleaner.ttl *config param which what i can
>> understand from the documentation, says cleans up all the metadata beyond
>> the time limit.
>>
>> I went through https://issues.apache.org/jira/browse/SPARK-5836
>> it says resolved, but there is no code commit
>>
>> Can anyone please throw some light on the issue.
>>
>>
>>
>

Java Streaming Context - File Stream use

2015-08-10 Thread Ashish Soni

Please help as not sure what is incorrect with below code as it gives me
complilaton error in eclipse

 SparkConf sparkConf = new
SparkConf().setMaster("local[4]").setAppName("JavaDirectKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));

*jssc.fileStream("/home/", String.class, String.class,
TextInputFormat.class);*

How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ashish Dutt

7077
15/08/01 14:09:06 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@10.210.250.400:7077:
akka.remote.InvalidAssociation: Invalid address:
akka.tcp://sparkMaster@10.210.250.400:7077
15/08/01 14:09:06 WARN Remoting: Tried to associate with unreachable
remote address [akka.tcp://sparkMaster@10.210.250.400:7077]. Address
is now gated for 5000 ms, all messages to this address will be
delivered to dead letters. Reason: Connection refused: no further
information: /10.210.250.400:7077
15/08/01 14:09:25 ERROR SparkDeploySchedulerBackend: Application has
been killed. Reason: All masters are unresponsive! Giving up.
15/08/01 14:09:25 WARN SparkDeploySchedulerBackend: Application ID is
not initialized yet.
15/08/01 14:09:25 ERROR OneForOneStrategy:
java.lang.NullPointerException
at 
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/08/01 14:09:25 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org
<http://org.apache.spark.sparkcontext.org/>$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "C:/Users/ashish
dutt/PycharmProjects/KafkaToHDFS/local2Remote.py", line 26, in

sc = SparkContext(conf=conf)
File "C:\spark-1.4.0\python\pyspark\context.py", line 113, in __init__
conf, jsc, profiler_cls)
File "C:\spark-1.4.0\python\pyspark\context.py", line 165, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\spark-1.4.0\python\pyspark\context.py", line 219, in
_initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\spark-1.4.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
line 701, in __call__
File "C:\spark-1.4.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org
<http://org.apache.spark.sparkcontext.org/>$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.(SparkContext.scala:543)
at org

PySpark in Pycharm- unable to connect to remote server

2015-08-05 Thread Ashish Dutt

iate with unreachable
remote address [akka.tcp://sparkMaster@10.210.250.400:7077]. Address
is now gated for 5000 ms, all messages to this address will be
delivered to dead letters. Reason: Connection refused: no further
information: /10.210.250.400:7077
15/08/01 14:09:25 ERROR SparkDeploySchedulerBackend: Application has
been killed. Reason: All masters are unresponsive! Giving up.
15/08/01 14:09:25 WARN SparkDeploySchedulerBackend: Application ID is
not initialized yet.
15/08/01 14:09:25 ERROR OneForOneStrategy:
java.lang.NullPointerException
at 
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/08/01 14:09:25 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at 
org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "C:/Users/ashish
dutt/PycharmProjects/KafkaToHDFS/local2Remote.py", line 26, in

sc = SparkContext(conf=conf)
File "C:\spark-1.4.0\python\pyspark\context.py", line 113, in __init__
conf, jsc, profiler_cls)
File "C:\spark-1.4.0\python\pyspark\context.py", line 165, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\spark-1.4.0\python\pyspark\context.py", line 219, in
_initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\spark-1.4.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
line 701, in __call__
File "C:\spark-1.4.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at 
org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.

spark.files.userClassPathFirst=true Return Error - Please help

2015-07-22 Thread Ashish Soni

Hi All ,

I am getting below error when i use the --conf
spark.files.userClassPathFirst=true parameter

Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most
recent failure: Lost task 3.3 in stage 0.0 (TID 32, 10.200.37.161):
java.lang.ClassCastException: cannot assign instance of scala.None$ to
field org.apache.spark.scheduler.Task.metrics of type scala.Option in
instance of org.apache.spark.scheduler.ResultTask

I am using as below

spark-submit --conf spark.files.userClassPathFirst=true --driver-memory 6g
--executor-memory 12g --executor-cores 4   --class
com.ericsson.engine.RateDriver --master local
/home/spark/workspace/simplerating/target/simplerating-0.0.1-SNAPSHOT.jar
spark://eSPARKMASTER:7077 hdfs://enamenode/user/spark

thanks

Class Loading Issue - Spark Assembly and Application Provided

2015-07-21 Thread Ashish Soni

Hi All ,

I am having a class loading issue as Spark Assembly is using google guice
internally and one of Jar i am using uses sisu-guice-3.1.0-no_aop.jar , How
do i load my class first so that it doesn't result in error and tell spark
to load its assembly later on
Ashish

XML Parsing

2015-07-19 Thread Ashish Soni

Hi All ,

I have an XML file with same tag repeated multiple times as below , Please
suggest what would be best way to process this data inside spark as ...

How can i extract each open and closing tag and process them or how can i
combine multiple line into single line





...
..
..

Thanks,

BroadCast on Interval ( eg every 10 min )

2015-07-16 Thread Ashish Soni

Hi All ,
How can i broadcast a data change to all the executor ever other 10 min or
1 min

Ashish

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Ashish Dutt

Hello Arun,
Thank you for the descriptive response.
And thank you for providing the sample file too. It certainly is a great
help.

Sincerely,
Ashish



On Mon, Jul 13, 2015 at 10:30 PM, Arun Verma 
wrote:

>
> PFA sample file
>
> On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma 
> wrote:
>
>> Hi,
>>
>> Yes it is. To do it follow these steps;
>> 1. cd spark/intallation/path/.../conf
>> 2. cp spark-env.sh.template spark-env.sh
>> 3. vi spark-env.sh
>> 4. SPARK_MASTER_PORT=9000(or any other available port)
>>
>> PFA sample file. I hope this will help.
>>
>> On Mon, Jul 13, 2015 at 7:24 PM, ashishdutt 
>> wrote:
>>
>>> Many thanks for your response.
>>> Regards,
>>> Ashish
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-change-the-default-port-number-7077-for-spark-tp23774p23797.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Thanks and Regards,
>> Arun Verma
>>
>
>
>
> --
> Thanks and Regards,
> Arun
>

Re: Data Processing speed SQL Vs SPARK

2015-07-13 Thread Ashish Mukherjee

MySQL and PgSQL scale to millions. Spark or any distributed/clustered
computing environment would be inefficient for the kind of data size you
mention. That's because of coordination of processes, moving data around
etc.

On Mon, Jul 13, 2015 at 5:34 PM, Sandeep Giri 
wrote:

> Even for 2L records the MySQL will be better.
>
> Regards,
> Sandeep Giri,
> +1-253-397-1945 (US)
> +91-953-899-8962 (IN)
> www.KnowBigData.com. 
>
> [image: linkedin icon]  [image:
> other site icon]   [image: facebook icon]
>  [image: twitter icon]
>  
>
>
> On Fri, Jul 10, 2015 at 9:54 AM, vinod kumar 
> wrote:
>
>> For records below 50,000 SQL is better right?
>>
>>
>> On Fri, Jul 10, 2015 at 12:18 AM, ayan guha  wrote:
>>
>>> With your load, either should be fine.
>>>
>>> I would suggest you to run couple of quick prototype.
>>>
>>> Best
>>> Ayan
>>>
>>> On Fri, Jul 10, 2015 at 2:06 PM, vinod kumar 
>>> wrote:
>>>
 Ayan,

 I would want to process a data which  nearly around 5 records to 2L
 records(in flat).

 Is there is any scaling is there to decide what technology is
 best?either SQL or SPARK?



 On Thu, Jul 9, 2015 at 9:40 AM, ayan guha  wrote:

> It depends on workload. How much data you would want to process?
> On 9 Jul 2015 22:28, "vinod kumar"  wrote:
>
>> Hi Everyone,
>>
>> I am new to spark.
>>
>> Am using SQL in my application to handle data in my application.I
>> have a thought to move to spark now.
>>
>> Is data processing speed of spark better than SQL server?
>>
>> Thank,
>> Vinod
>>
>

>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-07-13 Thread Ashish Dutt

Hello Rui Sun,

Thanks for your reply.
On reading the file "readme.md" in the section "Using SparkR from RStudio"
it mentions to set the .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R",
"lib"), .libPaths()))

Please tell me how I can set this in Windows environment? What I mean is
how to setup .libPaths()? where is it in windows environment
Thanks for your help

Sincerely,
Ashish Dutt


On Mon, Jul 13, 2015 at 3:48 PM, Sun, Rui  wrote:

> Hi, Kachau,
>
> If you are using SparkR with RStudio, have you followed the guidelines in
> the section "Using SparkR from RStudio" in
> https://github.com/apache/spark/tree/master/R ?
>
> 
> From: kachau [umesh.ka...@gmail.com]
> Sent: Saturday, July 11, 2015 12:30 AM
> To: user@spark.apache.org
> Subject: SparkR Error in sparkR.init(master=“local”) in RStudio
>
> I have installed the SparkR package from Spark distribution into the R
> library. I can call the following command and it seems to work properly:
> library(SparkR)
>
> However, when I try to get the Spark context using the following code,
>
> sc <- sparkR.init(master="local")
> It fails after some time with the following message:
>
> Error in sparkR.init(master = "local") :
>JVM is not ready after 10 seconds
> I have set JAVA_HOME, and I have a working RStudio where I can access other
> packages like ggplot2. I don't know why it is not working, and I don't even
> know where to investigate the issue.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Connecting to nodes on cluster

2015-07-09 Thread Ashish Dutt

Hello Akhil,

Thanks for the response. I will have to figure this out.

Sincerely,
Ashish

On Thu, Jul 9, 2015 at 3:40 PM, Akhil Das 
wrote:

> On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt 
> wrote:
>
>> Hi,
>>
>> We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
>> days I have been trying to connect my laptop to the server using spark
>>  but its been unsucessful.
>> The server contains data that needs to be cleaned and analysed.
>> The cluster and the nodes are on linux environment.
>> To connect to the nodes I am usnig SSH
>>
>> Question: Would it be better if I work directly on the nodes rather than
>> trying to connect my laptop to them ?
>>
>
> -> You will be able to connect to master machine in the cloud from your
> laptop
>
> , but you need to make sure that the master is able to connect back to
> your laptop (may require port forwarding on your router, firewalls etc.)
>  
> 
>
>> Question 2: If yes, then can you suggest any python and R IDE that I can
>> install on the nodes to make it work?
>>
>
> -> Once the master machine is able to connect to your laptop's public ip,
> then you can set the spark.driver.host and spark.driver.port properties and
> your job will get executed on the cluster.
> 
>
>
>>
>> Thanks for your help
>>
>>
>> Sincerely,
>> Ashish Dutt
>>
>>
>

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt

Hi Sujit,
Thanks for your response.

So i opened a new notebook using the command ipython notebook --profile
spark and tried the sequence of commands. i am getting errors. Attached is
the screenshot of the same.
Also I am attaching the  00-pyspark-setup.py for your reference. Looks
like, I have written something wrong here. Cannot seem to figure out, what
is it?

Thank you for your help


Sincerely,
Ashish Dutt

On Thu, Jul 9, 2015 at 11:53 AM, Sujit Pal  wrote:

> Hi Ashish,
>
> >> Nice post.
> Agreed, kudos to the author of the post, Benjamin Benfort of District Labs.
>
> >> Following your post, I get this problem;
> Again, not my post.
>
> I did try setting up IPython with the Spark profile for the edX Intro to
> Spark course (because I didn't want to use the Vagrant container) and it
> worked flawlessly with the instructions provided (on OSX). I haven't used
> the IPython/PySpark environment beyond very basic tasks since then though,
> because my employer has a Databricks license which we were already using
> for other stuff and we ended up doing the labs on Databricks.
>
> Looking at your screenshot though, I don't see why you think its picking
> up the default profile. One simple way of checking to see if things are
> working is to open a new notebook and try this sequence of commands:
>
> from pyspark import SparkContext
> sc = SparkContext("local", "pyspark")
> sc
>
> You should see something like this after a little while:
> 
>
> While the context is being instantiated, you should also see lots of log
> lines scroll by on the terminal where you started the "ipython notebook
> --profile spark" command - these log lines are from Spark.
>
> Hope this helps,
> Sujit
>
>
> On Wed, Jul 8, 2015 at 6:04 PM, Ashish Dutt 
> wrote:
>
>> Hi Sujit,
>> Nice post.. Exactly what I had been looking for.
>> I am relatively a beginner with Spark and real time data processing.
>> We have a server with CDH5.4 with 4 nodes. The spark version in our
>> server is 1.3.0
>> On my laptop I have spark 1.3.0 too and its using Windows 7 environment.
>> As per point 5 of your post I am able to invoke pyspark locally as in a
>> standalone mode.
>>
>> Following your post, I get this problem;
>>
>> 1. In section "Using Ipython notebook with spark" I cannot understand why
>> it is picking up the default profile and not the pyspark profile. I am sure
>> it is because of the path variables. Attached is the screenshot. Can you
>> suggest how to solve this.
>>
>> Current the path variables for my laptop are like
>> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
>> FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN",
>> MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"
>>
>>
>> Sincerely,
>> Ashish Dutt
>> PhD Candidate
>> Department of Information Systems
>> University of Malaya, Lembah Pantai,
>> 50603 Kuala Lumpur, Malaysia
>>
>> On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal  wrote:
>>
>>> You are welcome Davies. Just to clarify, I didn't write the post (not
>>> sure if my earlier post gave that impression, apologize if so), although I
>>> agree its great :-).
>>>
>>> -sujit
>>>
>>>
>>> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu 
>>> wrote:
>>>
>>>> Great post, thanks for sharing with us!
>>>>
>>>> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal 
>>>> wrote:
>>>> > Hi Julian,
>>>> >
>>>> > I recently built a Python+Spark application to do search relevance
>>>> > analytics. I use spark-submit to submit PySpark jobs to a Spark
>>>> cluster on
>>>> > EC2 (so I don't use the PySpark shell, hopefully thats what you are
>>>> looking
>>>> > for). Can't share the code, but the basic approach is covered in this
>>>> blog
>>>> > post - scroll down to the section "Writing a Spark Application".
>>>> >
>>>> >
>>>> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>>>> >
>>>> > Hope this helps,
>>>> >
>>>> > -sujit
>>>> >
>>>> >
>>>> > On Wed, Jul 8, 2015 at 7:46 AM, Julian 
>>>> wrote:
>>>> >>
>>>> >> Hey.
>>>> >>
>>>> >>

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread Ashish Dutt

Hi,

I get the error, "DLL load failed: %1 is not a valid win32 application"
whenever I invoke pyspark. Attached is the screenshot of the same.
Is there any way I can get rid of it. Still being new to PySpark and have
had,  a not so pleasant experience so far most probably because I am on a
windows environment.
Therefore, I am afraid that this error might cause me trouble as I continue
my journey exploring pyspark.
I have already check SO and this user list but there are no posts for the
same.

My environment is python version 2.7, OS- Windows 7, Spark- ver 1.3.0
Appreciate your help.


Sincerely,
Ashish Dutt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt

Hi Sujit,
Nice post.. Exactly what I had been looking for.
I am relatively a beginner with Spark and real time data processing.
We have a server with CDH5.4 with 4 nodes. The spark version in our server
is 1.3.0
On my laptop I have spark 1.3.0 too and its using Windows 7 environment. As
per point 5 of your post I am able to invoke pyspark locally as in a
standalone mode.

Following your post, I get this problem;

1. In section "Using Ipython notebook with spark" I cannot understand why
it is picking up the default profile and not the pyspark profile. I am sure
it is because of the path variables. Attached is the screenshot. Can you
suggest how to solve this.

Current the path variables for my laptop are like
SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN",
MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"


Sincerely,
Ashish Dutt
PhD Candidate
Department of Information Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia

On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal  wrote:

> You are welcome Davies. Just to clarify, I didn't write the post (not sure
> if my earlier post gave that impression, apologize if so), although I agree
> its great :-).
>
> -sujit
>
>
> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu  wrote:
>
>> Great post, thanks for sharing with us!
>>
>> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal  wrote:
>> > Hi Julian,
>> >
>> > I recently built a Python+Spark application to do search relevance
>> > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster
>> on
>> > EC2 (so I don't use the PySpark shell, hopefully thats what you are
>> looking
>> > for). Can't share the code, but the basic approach is covered in this
>> blog
>> > post - scroll down to the section "Writing a Spark Application".
>> >
>> >
>> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>> >
>> > Hope this helps,
>> >
>> > -sujit
>> >
>> >
>> > On Wed, Jul 8, 2015 at 7:46 AM, Julian 
>> wrote:
>> >>
>> >> Hey.
>> >>
>> >> Is there a resource that has written up what the necessary steps are
>> for
>> >> running PySpark without using the PySpark shell?
>> >>
>> >> I can reverse engineer (by following the tracebacks and reading the
>> shell
>> >> source) what the relevant Java imports needed are, but I would assume
>> >> someone has attempted this before and just published something I can
>> >> either
>> >> follow or install? If not, I have something that pretty much works and
>> can
>> >> publish it, but I'm not a heavy Spark user, so there may be some things
>> >> I've
>> >> left out that I haven't hit because of how little of pyspark I'm
>> playing
>> >> with.
>> >>
>> >> Thanks,
>> >> Julian
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt

The error is JVM has not responded after 10 seconds.
On 08-Jul-2015 10:54 PM, "ayan guha"  wrote:

> What's the error you are getting?
> On 9 Jul 2015 00:01, "Ashish Dutt"  wrote:
>
>> Hi,
>>
>> We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
>> days I have been trying to connect my laptop to the server using spark
>>  but its been unsucessful.
>> The server contains data that needs to be cleaned and analysed.
>> The cluster and the nodes are on linux environment.
>> To connect to the nodes I am usnig SSH
>>
>> Question: Would it be better if I work directly on the nodes rather than
>> trying to connect my laptop to them ?
>> Question 2: If yes, then can you suggest any python and R IDE that I can
>> install on the nodes to make it work?
>>
>> Thanks for your help
>>
>>
>> Sincerely,
>> Ashish Dutt
>>
>>

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt

Hello Sooraj,
Thank you for your response. It indeed give me a ray of hope now.
Can you please suggest any good tutorials for installing and working with
ipython notebook server on the node.
Thank you
Ashish
On 08-Jul-2015 6:16 PM, "sooraj"  wrote:
>
> Hi Ashish,
>
> I am running ipython notebook server on one of the nodes of the cluster
(HDP). Setting it up was quite straightforward, and I guess I followed the
same references that you linked to. Then I access the notebook remotely
from my development PC. Never tried to connect a local ipython (on a PC) to
a remote Spark cluster. Not sure if that is possible.
>
> - Sooraj
>
> On 8 July 2015 at 15:31, Ashish Dutt  wrote:
>>
>> My apologies for double posting but I missed the web links that i
followed which are 1, 2, 3
>>
>> Thanks,
>> Ashish
>>
>> On Wed, Jul 8, 2015 at 5:49 PM, sooraj  wrote:
>>>
>>> That turned out to be a silly data type mistake. At one point in the
iterative call, I was passing an integer value for the parameter 'alpha' of
the ALS train API, which was expecting a Double. So, py4j in fact
complained that it cannot take a method that takes an integer value for
that parameter.
>>>
>>> On 8 July 2015 at 12:35, sooraj  wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am using MLlib collaborative filtering API on an implicit preference
data set. From a pySpark notebook, I am iteratively creating the matrix
factorization model with the aim of measuring the RMSE for each combination
of parameters for this API like the rank, lambda and alpha. After the code
successfully completed six iterations, on the seventh call of the
ALS.trainImplicit API, I get a confusing exception that says py4j cannot
find the method trainImplicitALSmodel.  The full trace is included at the
end of the email.
>>>>
>>>> I am running Spark over YARN (yarn-client mode) with five executors.
This error seems to be happening completely on the driver as I don't see
any error on the Spark web interface. I have tried changing the
spark.yarn.am.memory configuration value, but it doesn't help. Any
suggestion on how to debug this will be very helpful.
>>>>
>>>> Thank you,
>>>> Sooraj
>>>>
>>>> Here is the full error trace:
>>>>
>>>>
---
>>>> Py4JError Traceback (most recent call
last)
>>>>  in ()
>>>>   3
>>>>   4 for index, (r, l, a, i) in enumerate(itertools.product(ranks,
lambdas, alphas, iters)):
>>>> > 5 model = ALS.trainImplicit(scoreTableTrain, rank = r,
iterations = i, lambda_ = l, alpha = a)
>>>>   6
>>>>   7 predictionsTrain = model.predictAll(userProductTrainRDD)
>>>>
>>>>
/usr/local/spark-1.4/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/recommendation.pyc
in trainImplicit(cls, ratings, rank, iterations, lambda_, blocks, alpha,
nonnegative, seed)
>>>> 198   nonnegative=False, seed=None):
>>>> 199 model = callMLlibFunc("trainImplicitALSModel",
cls._prepare(ratings), rank,
>>>> --> 200   iterations, lambda_, blocks,
alpha, nonnegative, seed)
>>>> 201 return MatrixFactorizationModel(model)
>>>> 202
>>>>
>>>>
/usr/local/spark-1.4/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callMLlibFunc(name, *args)
>>>> 126 sc = SparkContext._active_spark_context
>>>> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
>>>> --> 128 return callJavaFunc(sc, api, *args)
>>>> 129
>>>> 130
>>>>
>>>>
/usr/local/spark-1.4/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callJavaFunc(sc, func, *args)
>>>> 119 """ Call Java Function """
>>>> 120 args = [_py2java(sc, a) for a in args]
>>>> --> 121 return _java2py(sc, func(*args))
>>>> 122
>>>> 123
>>>>
>>>> /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in
__call__(self, *args)
>>>> 536 answer = self.gateway_client.send_command(command)
>>>> 537 return_value = get_return_value(answer,
self.gateway_client,
>>>> --> 538 self.target_id, self.name)
>>>> 539
>>>> 540 for temp_arg in temp_args:
>>>>
>>>> /usr/local/lib/python2.7/site-packag

Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt

Hi,

We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
days I have been trying to connect my laptop to the server using spark
 but its been unsucessful.
The server contains data that needs to be cleaned and analysed.
The cluster and the nodes are on linux environment.
To connect to the nodes I am usnig SSH

Question: Would it be better if I work directly on the nodes rather than
trying to connect my laptop to them ?
Question 2: If yes, then can you suggest any python and R IDE that I can
install on the nodes to make it work?

Thanks for your help


Sincerely,
Ashish Dutt

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Ashish Dutt

Hello Prateek,
I started with getting the pre built binaries so as to skip the hassle of
building them from scratch.
I am not familiar with scala so can't comment on it.
I have documented my experiences on my blog www.edumine.wordpress.com
Perhaps it might be useful to you.
 On 08-Jul-2015 9:39 PM, "Prateek ."  wrote:

>  Hi
>
>
>
> I am beginner to scala and spark. I am trying to set up eclipse
> environment to develop spark program  in scala, then take it’s  jar  for
> spark-submit.
>
> How shall I start? To start my  task includes, setting up eclipse for
> scala and spark, getting dependencies resolved, building project using
> maven/sbt.
>
> Is there any good blog or documentation that is can follow.
>
>
>
> Thanks
>
>  "DISCLAIMER: This message is proprietary to Aricent and is intended
> solely for the use of the individual to whom it is addressed. It may
> contain privileged or confidential information and should not be circulated
> or used for any purpose other than for what it is intended. If you have
> received this message in error, please notify the originator immediately.
> If you are not the intended recipient, you are notified that you are
> strictly prohibited from using, copying, altering, or disclosing the
> contents of this message. Aricent accepts no responsibility for loss or
> damage arising from the use of the information transmitted by this email
> including damage from virus."
>

1 2 >

1 - 100 of 184 matches

Mail list logo