Re: mesos or kubernetes ?

2016-08-14 Thread Gurvinder Singh
On 08/13/2016 08:24 PM, guyoh wrote:
> My company is trying to decide whether to use kubernetes or mesos. Since we
> are planning to use Spark in the near future, I was wandering what is the
> best choice for us. 
> Thanks, 
> Guy
> 
Both Kubernetes and Mesos enables you to share your infrastructure with
other workloads. In K8s it is mainly standalone mode and to provide
failure recovery you can use the options mentioned here
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability
I have tested file system recovery with K8s and it works fine, as K8s
restart the master with in few seconds. In K8s you can dynamically scale
your cluster as K8s support horizontal pod autoscaling
http://blog.kubernetes.io/2016/03/using-Spark-and-Zeppelin-to-process-Big-Data-on-Kubernetes.html
which helps you spin down your spark cluster when not is use and you can
run other workloads using docker/rkt containers on your resources and
scale up spark cluster when needed given you have free resources. You
don't need static partitioning for your spark cluster when running on
k8s, as spark will run in containers and will share the same underlying
resources as others.

You can soonish (https://github.com/apache/spark/pull/13950) use the
authentication based on Github,Google,FB etc to access Spark master UI
when deployed in standalone e.g. on K8s and only expose the master UI to
access all other UIs (worker logs, app).

As usual choice is yours, I just wanted to give you the info about k8s side.

- Gurvinder
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark master ui to proxy app and worker ui

2016-03-03 Thread Gurvinder Singh
Hi,

I am wondering if it is possible for the spark standalone master UI to
proxy app/driver UI and worker UI. The reason for this is that currently
if you want to access UI of driver and worker to see logs, you need to
have access to their IP:port which makes it harder to open up from
networking point of view. So operationally it makes life easier if
master can simply proxy those connections and allow access both app and
worker UI details from master UI itself.

Master does not need to have content stream to it all the time, only
when user wants to access contents from other UIs then it simply proxy
the request/response during that duration. Thus master will not have to
incur extra load all the time.

Thanks,
Gurvinder

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tungsten and Spark Streaming

2015-09-10 Thread Gurvinder Singh
On 09/10/2015 07:42 AM, Tathagata Das wrote:
> Rewriting is necessary. You will have to convert RDD/DStream operations
> to DataFrame operations. So get the RDDs in DStream, using
> transform/foreachRDD, convert to DataFrames and then do DataFrame
> operations.

Are there any plans for 1.6 or later to add support of tungsten to
RDD/DStream directly or it is intended that users should switch to
dataframe rather then operating on RDD/Dstream level.

> 
> On Wed, Sep 9, 2015 at 9:23 PM, N B  > wrote:
> 
> Hello,
> 
> How can we start taking advantage of the performance gains made
> under Project Tungsten in Spark 1.5 for a Spark Streaming program? 
> 
> From what I understand, this is available by default for Dataframes.
> But for a program written using Spark Streaming, would we see any
> potential gains "out of the box" in 1.5 or will we have to rewrite
> some portions of the application code to realize that benefit?
> 
> Any insight/documentation links etc in this regard will be appreciated.
> 
> Thanks
> Nikunj
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Gurvinder Singh
On 09/05/2015 11:22 AM, Reynold Xin wrote:
> Try increase the shuffle memory fraction (by default it is only 16%).
> Again, if you run Spark 1.5, this will probably run a lot faster,
> especially if you increase the shuffle memory fraction ...
Hi Reynold,

Does the 1.5 has better join/cogroup performance for RDD case too or
only for SQL.

- Gurvinder
> 
> On Tue, Sep 1, 2015 at 8:13 AM, Thomas Dudziak  > wrote:
> 
> While it works with sort-merge-join, it takes about 12h to finish
> (with 1 shuffle partitions). My hunch is that the reason for
> that is this:
> 
> INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB
> to disk (62 times so far)
> 
> (and lots more where this comes from).
> 
> On Sat, Aug 29, 2015 at 7:17 PM, Reynold Xin  > wrote:
> 
> Can you try 1.5? This should work much, much better in 1.5 out
> of the box.
> 
> For 1.4, I think you'd want to turn on sort-merge-join, which is
> off by default. However, the sort-merge join in 1.4 can still
> trigger a lot of garbage, making it slower. SMJ performance is
> probably 5x - 1000x better in 1.5 for your case.
> 
> 
> On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak
> mailto:tom...@gmail.com>> wrote:
> 
> I'm getting errors like "Removing executor with no recent
> heartbeats" & "Missing an output location for shuffle"
> errors for a large SparkSql join (1bn rows/2.5TB joined with
> 1bn rows/30GB) and I'm not sure how to configure the job to
> avoid them.
> 
> The initial stage completes fine with some 30k tasks on a
> cluster with 70 machines/10TB memory, generating about 6.5TB
> of shuffle writes, but then the shuffle stage first waits
> 30min in the scheduling phase according to the UI, and then
> dies with the mentioned errors.
> 
> I can see in the GC logs that the executors reach their
> memory limits (32g per executor, 2 workers per machine) and
> can't allocate any more stuff in the heap. Fwiw, the top 10
> in the memory use histogram are:
> 
> num #instances #bytes  class name
> --
>1: 24913959511958700560
>  scala.collection.immutable.HashMap$HashMap1
>2: 251085327 8034730464 
>  scala.Tuple2
>3: 243694737 5848673688  java.lang.Float
>4: 231198778 5548770672  java.lang.Integer
>5:  72191585 4298521576
>  [Lscala.collection.immutable.HashMap;
>6:  72191582 2310130624
>  scala.collection.immutable.HashMap$HashTrieMap
>7:  74114058 1778737392  java.lang.Long
>8:   6059103  779203840  [Ljava.lang.Object;
>9:   5461096  174755072
>  scala.collection.mutable.ArrayBuffer
>   10: 34749   70122104  [B
> 
> Relevant settings are (Spark 1.4.1, Java 8 with G1 GC):
> 
> spark.core.connection.ack.wait.timeout 600
> spark.executor.heartbeatInterval   60s
> spark.executor.memory  32g
> spark.mesos.coarse false
> spark.network.timeout  600s
> spark.shuffle.blockTransferService netty
> spark.shuffle.consolidateFiles true
> spark.shuffle.file.buffer  1m
> spark.shuffle.io.maxRetries6
> spark.shuffle.manager  sort
> 
> The join is currently configured with
> spark.sql.shuffle.partitions=1000 but that doesn't seem to
> help. Would increasing the partitions help ? Is there a
> formula to determine an approximate partitions number value
> for a join ?
> Any help with this job would be appreciated !
> 
> cheers,
> Tom
> 
> 
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark and mesos issue

2014-09-15 Thread Gurvinder Singh
It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one slave not on multiple of them. Theis might be
due to the size of ID, as spark out the error as

14/09/16 08:04:29 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140822-112818-711206558-5050-25951-0

where as in the mesos slave I see logs as

I0915 20:55:18.293903 31434 containerizer.cpp:392] Starting container
'3aab2237-d32f-470d-a206-7bada454ad3f' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0053'

I0915 20:53:28.039218 31437 containerizer.cpp:392] Starting container
'fe4b344f-16c9-484a-9c2f-92bd92b43f6d' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0050'


you the last 3 digits of ID are missing in spark where as they are
different in mesos slaves.

- Gurvinder
On 09/15/2014 11:13 PM, Brenden Matthews wrote:
> I started hitting a similar problem, and it seems to be related to 
> memory overhead and tasks getting OOM killed.  I filed a ticket
> here:
> 
> https://issues.apache.org/jira/browse/SPARK-3535
> 
> On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez
> mailto:rayrod2...@gmail.com>> wrote:
> 
> I'll set some time aside today to gather and post some logs and 
> details about this issue from our end.
> 
> 
> On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone  <mailto:vinodk...@gmail.com>> wrote:
> 
> 
> 
> 
> On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone  <mailto:vi...@twitter.com>> wrote:
> 
> 
> On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh 
> mailto:gurvinder.si...@uninett.no>>
> wrote:
> 
> ERROR storage.BlockManagerMasterActor: Got two different block
> manager registrations on 201407031041-1227224054-5050-24004-0
> 
> Googling about it seems that mesos is starting slaves at the same
> time and giving them the same id. So may bug in mesos ?
> 
> 
> Has this issue been resolved? We need more information to triage
> this. Maybe some logs that show the lifecycle of the duplicate
> instances?
> 
> 
> @vinodkone
> 
> 
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-04 Thread Gurvinder Singh
I want to add that there a regression when using pyspark to read data
from HDFS. its performance during map tasks has gone down approx 1 ->
0.5x. I have tested the 1.0.2 and the performance was fine, but the 1.1
release candidate has this issue. I tested by setting the following
properties to make sure it was not due to these.

set("spark.io.compression.codec","lzf").set("spark.shuffle.spill","false")

in conf object. Let me know if you need further information.

Regards,
Gurvinder
On 09/04/2014 07:47 AM, Denny Lee wrote:
> When I start the thrift server (on Spark 1.1 RC4) via:
> ./sbin/start-thriftserver.sh --master spark://hostname:7077
> --driver-class-path $CLASSPATH
> 
> It appears that the thrift server is starting off of localhost as
> opposed to hostname.  I have set the spark-env.sh to use the hostname,
> modified the /etc/hosts for the hostname, and it appears to work properly.
> 
> But when I start the thrift server, connectivity can only be via
> localhost:1 as opposed to hostname:1.
> 
> Any ideas on what configurations I may be setting incorrectly here?
> 
> Thanks!
> Denny
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hdfs read performance issue

2014-08-20 Thread Gurvinder Singh
I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to

14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO rdd.HadoopRDD: Input split:
hdfs://test/test_flows/test-2014-05-06.csv:9529458688+134217728
14/08/19 12:37:41 INFO python.PythonRDD: Times: total = 16312, boot = 8,
init = 134, finish = 16170
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[dstip]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@6374d682,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[dstport]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@baf23d1,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[srcport]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@17587455,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[stime]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@303d846c,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[endtime]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@16c0e732,
ratio: 1.0
14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for
[srcip]:
org.apache.spark.sql.columnar.compression.PassThrough$Encoder@528a8f49,
ratio: 1.0
14/08/19 12:37:41 INFO storage.MemoryStore: ensureFreeSpace(64834432)
called with curMem=1556288334, maxMem=944671

In Aug 1 version the log file from processing the same size data is of
approx 136KB where from latest git it is of 23 MB. The only message
which makes logfile grow is the mentioned above. It appears the latest
git version has an issue when reading data and converting it columnar
format. As this conversion happens when Spark is trying to create a RDD,
once for each RDD. In latest git version it just simply might be doing
for each record in RDD. That's what causing it to slow read from disk as
it spends time in this operation.

Any suggestion/help in this regard will be helpful.

- Gurvinder
On 08/14/2014 10:27 AM, Gurvinder Singh wrote:
> Hi,
> 
> I am running spark from the git directly. I recently compiled the newer
> version Aug 13 version and it has performance drop of 2-3x in read from
> HDFS compare to git version of Aug 1. So I am wondering which commit
> would have cause such an issue in read performance. The performance is
> almost same once data is cached in memory, but read from HDFS is well
> slow compare to Aug 1 version.
> 
> - Gurvinder
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



read performance issue

2014-08-14 Thread Gurvinder Singh
Hi,

I am running spark from the git directly. I recently compiled the newer
version Aug 13 version and it has performance drop of 2-3x in read from
HDFS compare to git version of Aug 1. So I am wondering which commit
would have cause such an issue in read performance. The performance is
almost same once data is cached in memory, but read from HDFS is well
slow compare to Aug 1 version.

- Gurvinder

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQLCtx cacheTable

2014-08-04 Thread Gurvinder Singh
On 08/04/2014 10:57 PM, Michael Armbrust wrote:
> If mesos is allocating a container that is exactly the same as the max
> heap size then that is leaving no buffer space for non-heap JVM memory,
> which seems wrong to me.
> 
This can be a cause. I am now wondering how mesos pick up the size and
setup the -Xmx parameter.
> The problem here is that cacheTable is more aggressive about grabbing
> large ByteBuffers during caching (which it later releases when it knows
> the exact size of the data)  There is a discussion here about trying to
> improve this: https://issues.apache.org/jira/browse/SPARK-2650
> 
I am not sure if this issue is the one which is causing issue for us. As
we have approx 60GB of cached data size, where as each executor memory
is 17GB and there are 15 of them so in total 255GB which is way more
than cached data of 60GB.

Any suggestions as where to look for changing the mesos setting in this
case.

- Gurvinder
> 
> On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh
> mailto:gurvinder.si...@uninett.no>> wrote:
> 
> On 08/03/2014 02:33 AM, Michael Armbrust wrote:
> > I am not a mesos expert... but it sounds like there is some mismatch
> > between the size that mesos is giving you and the maximum heap size of
> > the executors (-Xmx).
> >
> It seems that mesos is giving the correct size to java process. It has
> exact size set in -Xms/-Xmx params. Do you if somehow I can find which
> class or thread inside the spark jvm process is using how much memory
> and see which makes it to reach the memory limit on CacheTable case
> where as not in cache RDD case.
> 
> - Gurvinder
> >
> > On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
> > mailto:gurvinder.si...@uninett.no>
> <mailto:gurvinder.si...@uninett.no
> <mailto:gurvinder.si...@uninett.no>>> wrote:
> >
> > It is not getting out of memory exception. I am using Mesos as
> cluster
> > manager and it says when I use cacheTable that the container
> has used
> > all of its allocated memory and thus kill it. I can see it in
> the logs
> > on mesos-slave where executor runs. But on the web UI of spark
> > application, it shows that is still have 4-5GB space left for
> > caching/storing. So I am wondering how the memory is handled in
> > cacheTable case. Does it reserve the memory storage and other
> parts run
> > out of their memory. I also tries to change the
> > "spark.storage.memoryFraction" but that did not help.
> >
> > - Gurvinder
> > On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> > > Are you getting OutOfMemoryExceptions with cacheTable? or
> what do you
> > > mean when you say you have to specify larger executor
> memory?  You
> > might
> > > be running into SPARK-2650
> > > <https://issues.apache.org/jira/browse/SPARK-2650>.
> > >
> > > Is there something else you are trying to accomplish by
> setting the
> > > persistence level?  If you are looking for something like
> > DISK_ONLY you
> > > can simulate that now using saveAsParquetFile and parquetFile.
> > >
> > > It is possible long term that we will automatically map the
> > standard RDD
> > > persistence levels to these more efficient implementations
> in the
> > future.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> > >  <mailto:gurvinder.si...@uninett.no>
> <mailto:gurvinder.si...@uninett.no <mailto:gurvinder.si...@uninett.no>>
> > <mailto:gurvinder.si...@uninett.no
> <mailto:gurvinder.si...@uninett.no>
> > <mailto:gurvinder.si...@uninett.no
> <mailto:gurvinder.si...@uninett.no>>>> wrote:
> > >
> > > Thanks Michael for explaination. Actually I tried
> caching the
> > RDD and
> > > making table on it. But the performance for cacheTable
> was 3X
> > better
> > > than caching RDD. Now I know why it is better. But is it
> > possible to
> > > add the support for persistence level into cacheTable itself
> > like RDD.
> > > May be it is not related, but on the same size of data set,
> > when I use
> > > cache

Re: SQLCtx cacheTable

2014-08-03 Thread Gurvinder Singh
On 08/03/2014 02:33 AM, Michael Armbrust wrote:
> I am not a mesos expert... but it sounds like there is some mismatch
> between the size that mesos is giving you and the maximum heap size of
> the executors (-Xmx).
> 
It seems that mesos is giving the correct size to java process. It has
exact size set in -Xms/-Xmx params. Do you if somehow I can find which
class or thread inside the spark jvm process is using how much memory
and see which makes it to reach the memory limit on CacheTable case
where as not in cache RDD case.

- Gurvinder
> 
> On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
> mailto:gurvinder.si...@uninett.no>> wrote:
> 
> It is not getting out of memory exception. I am using Mesos as cluster
> manager and it says when I use cacheTable that the container has used
> all of its allocated memory and thus kill it. I can see it in the logs
> on mesos-slave where executor runs. But on the web UI of spark
> application, it shows that is still have 4-5GB space left for
> caching/storing. So I am wondering how the memory is handled in
> cacheTable case. Does it reserve the memory storage and other parts run
> out of their memory. I also tries to change the
> "spark.storage.memoryFraction" but that did not help.
> 
> - Gurvinder
> On 08/01/2014 08:42 AM, Michael Armbrust wrote:
> > Are you getting OutOfMemoryExceptions with cacheTable? or what do you
> > mean when you say you have to specify larger executor memory?  You
> might
> > be running into SPARK-2650
> > <https://issues.apache.org/jira/browse/SPARK-2650>.
> >
> > Is there something else you are trying to accomplish by setting the
> > persistence level?  If you are looking for something like
> DISK_ONLY you
> > can simulate that now using saveAsParquetFile and parquetFile.
> >
> > It is possible long term that we will automatically map the
> standard RDD
> > persistence levels to these more efficient implementations in the
> future.
> >
> >
> > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
> > mailto:gurvinder.si...@uninett.no>
> <mailto:gurvinder.si...@uninett.no
> <mailto:gurvinder.si...@uninett.no>>> wrote:
> >
> > Thanks Michael for explaination. Actually I tried caching the
> RDD and
> > making table on it. But the performance for cacheTable was 3X
> better
> > than caching RDD. Now I know why it is better. But is it
> possible to
> > add the support for persistence level into cacheTable itself
> like RDD.
> > May be it is not related, but on the same size of data set,
> when I use
> > cacheTable I have to specify larger executor memory than I need in
> > case of caching RDD. Although in the storage tab on status web
> UI, the
> > memory footprint is almost same 58.3 GB in cacheTable and
> 59.7GB in
> > cache RDD. Is it possible that there is some memory leak or
> cacheTable
> > works differently and thus require higher memory. The
> difference is
> > 5GB per executor for the dataset of size 122 GB.
> >
> > Thanks,
> > Gurvinder
> > On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> > > cacheTable uses a special columnar caching technique that is
> > > optimized for SchemaRDDs.  It something similar to
> MEMORY_ONLY_SER
> > > but not quite. You can specify the persistence level on the
> > > SchemaRDD itself and register that as a temporary table,
> however it
> > > is likely you will not get as good performance.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh
> > >  <mailto:gurvinder.si...@uninett.no>
> <mailto:gurvinder.si...@uninett.no <mailto:gurvinder.si...@uninett.no>>
> > <mailto:gurvinder.si...@uninett.no
> <mailto:gurvinder.si...@uninett.no>
> <mailto:gurvinder.si...@uninett.no
> <mailto:gurvinder.si...@uninett.no>>>>
> > > wrote:
> > >
> > > Hi,
> > >
> > > I am wondering how can I specify the persistence level in
> > > cacheTable. As it is takes only table name as parameter. It
> should
> > > be possible to specify the persistence level.
> > >
> > > - Gurvinder
> > >
> > >
> >
> >
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQLCtx cacheTable

2014-07-31 Thread Gurvinder Singh
Thanks Michael for explaination. Actually I tried caching the RDD and
making table on it. But the performance for cacheTable was 3X better
than caching RDD. Now I know why it is better. But is it possible to
add the support for persistence level into cacheTable itself like RDD.
May be it is not related, but on the same size of data set, when I use
cacheTable I have to specify larger executor memory than I need in
case of caching RDD. Although in the storage tab on status web UI, the
memory footprint is almost same 58.3 GB in cacheTable and 59.7GB in
cache RDD. Is it possible that there is some memory leak or cacheTable
works differently and thus require higher memory. The difference is
5GB per executor for the dataset of size 122 GB.

Thanks,
Gurvinder
On 08/01/2014 04:42 AM, Michael Armbrust wrote:
> cacheTable uses a special columnar caching technique that is
> optimized for SchemaRDDs.  It something similar to MEMORY_ONLY_SER
> but not quite. You can specify the persistence level on the
> SchemaRDD itself and register that as a temporary table, however it
> is likely you will not get as good performance.
> 
> 
> On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh 
> mailto:gurvinder.si...@uninett.no>>
> wrote:
> 
> Hi,
> 
> I am wondering how can I specify the persistence level in
> cacheTable. As it is takes only table name as parameter. It should
> be possible to specify the persistence level.
> 
> - Gurvinder
> 
> 



SQLCtx cacheTable

2014-07-31 Thread Gurvinder Singh
Hi,

I am wondering how can I specify the persistence level in cacheTable. As
it is takes only table name as parameter. It should be possible to
specify the persistence level.

- Gurvinder


Re: reading compress lzo files

2014-07-05 Thread Gurvinder Singh
On 07/06/2014 05:19 AM, Nicholas Chammas wrote:
> On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
> mailto:gurvinder.si...@uninett.no>> wrote:
> 
> csv =
> 
> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()
> 
> Does anyone know what the rough equivalent of this would be in the Scala
> API?
> 
I am not sure, I haven't tested it using scala.
com.hadoop.mapreduce.LzoTextInputFormat class is from this package
https://github.com/twitter/hadoop-lzo

I have installed it from clourdera "hadoop-lzo" package with liblzo2-2
debian package on all of my workers. Make sure you have hadoop-lzo.jar
in your class path for spark.

- Gurvinder

> I am trying the following, but the first import yields an error on my
> |spark-ec2| cluster:
> 
> |import com.hadoop.mapreduce.LzoTextInputFormat
> import org.apache.hadoop.io.LongWritable
> import org.apache.hadoop.io.Text
> 
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>  LzoTextInputFormat, LongWritable, Text)
> |
> 
> |scala> import com.hadoop.mapreduce.LzoTextInputFormat
> :12: error: object hadoop is not a member of package com
>import com.hadoop.mapreduce.LzoTextInputFormat
> |
> 
> Nick
> 
> ​




Re: reading compress lzo files

2014-07-04 Thread Gurvinder Singh
an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile

csv =
sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count()

- Gurvinder
On 07/03/2014 06:24 PM, Gurvinder Singh wrote:
> Hi all,
> 
> I am trying to read the lzo files. It seems spark recognizes that the
> input file is compressed and got the decompressor as
> 
> 14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
> 14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev
> ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
> 14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
> deprecated. Instead, use io.native.lib.available
> 14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
> [.lzo]
> 
> But it has two issues
> 
> 1. It just stuck here without doing anything waited for 15 min for a
> small files.
> 2. I used the hadoop-lzo to create the index so that spark can split
> the input to multiple maps but spark creates only one mapper.
> 
> I am using python with reading using sc.textFile(). Spark version is
> of the git master.
> 
> Regards,
> Gurvinder
> 



Re: issue with running example code

2014-07-04 Thread Gurvinder Singh
In the end it turns out that the issue was caused by a config settings
in spark-defaults.conf. After removing this setting

spark.files.userClassPathFirst   true

things are back to normal. Just reporting in case f someone will have
the same issue.

- Gurvinder
On 07/03/2014 06:49 PM, Gurvinder Singh wrote:
> Just to provide more information on this issue. It seems that SPARK_HOME
> environment variable is causing the issue. If I unset the variable in
> spark-class script and run in the local mode my code runs fine without
> the exception. But if I run with SPARK_HOME, I get the exception
> mentioned below. I could run without setting SPARK_HOME but it is not
> possible to run in the cluster settings, as this tells where is spark on
> worker nodes. E.g. we are using Mesos as cluster manager, thus when set
> master to mesos we get the exception as SPARK_HOME is not set.
> 
> Just to mention again the pyspark works fine as well as spark-shell,
> only when we are running compiled jar it seems SPARK_HOME causes some
> java run time issues that we get class cast exception.
> 
> Thanks,
> Gurvinder
> On 07/01/2014 09:28 AM, Gurvinder Singh wrote:
>> Hi,
>>
>> I am having issue in running scala example code. I have tested and able
>> to run successfully python example code, but when I run the scala code I
>> get this error
>>
>> java.lang.ClassCastException: cannot assign instance of
>> org.apache.spark.examples.SparkPi$$anonfun$1 to field
>> org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
>> org.apache.spark.rdd.MappedRDD
>>
>> I have compiled spark from the github directly and running with the
>> command as
>>
>> spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
>> --class org.apache.spark.examples.SparkPi 5 --jars
>> /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar
>>
>> Any suggestions will be helpful.
>>
>> Thanks,
>> Gurvinder
>>
> 



spark and mesos issue

2014-07-04 Thread Gurvinder Singh
We are getting this issue when we are running jobs with close to 1000
workers. Spark is from the github version and mesos is 0.19.0

ERROR storage.BlockManagerMasterActor: Got two different block manager
registrations on 201407031041-1227224054-5050-24004-0

Googling about it seems that mesos is starting slaves at the same time
and giving them the same id. So may bug in mesos ?

Thanks,
Gurvinder
On 07/04/2014 01:03 AM, Vinod Kone wrote:
> correct url:
> 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
> 
> On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone  > wrote:
> 
> Hi,
> 
> We are planning to release 0.19.1 (likely next week) which will be a
> bug fix release. Specifically, these are the fixes that we are
> planning to cherry pick.
> 
> 
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
> If there are other critical fixes that need to be backported to
> 0.19.1 please reply here as soon as possible.
> 
> Thanks,
> 
> 



Re: issue with running example code

2014-07-03 Thread Gurvinder Singh
Just to provide more information on this issue. It seems that SPARK_HOME
environment variable is causing the issue. If I unset the variable in
spark-class script and run in the local mode my code runs fine without
the exception. But if I run with SPARK_HOME, I get the exception
mentioned below. I could run without setting SPARK_HOME but it is not
possible to run in the cluster settings, as this tells where is spark on
worker nodes. E.g. we are using Mesos as cluster manager, thus when set
master to mesos we get the exception as SPARK_HOME is not set.

Just to mention again the pyspark works fine as well as spark-shell,
only when we are running compiled jar it seems SPARK_HOME causes some
java run time issues that we get class cast exception.

Thanks,
Gurvinder
On 07/01/2014 09:28 AM, Gurvinder Singh wrote:
> Hi,
> 
> I am having issue in running scala example code. I have tested and able
> to run successfully python example code, but when I run the scala code I
> get this error
> 
> java.lang.ClassCastException: cannot assign instance of
> org.apache.spark.examples.SparkPi$$anonfun$1 to field
> org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
> org.apache.spark.rdd.MappedRDD
> 
> I have compiled spark from the github directly and running with the
> command as
> 
> spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
> --class org.apache.spark.examples.SparkPi 5 --jars
> /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar
> 
> Any suggestions will be helpful.
> 
> Thanks,
> Gurvinder
> 



reading compress lzo files

2014-07-03 Thread Gurvinder Singh
Hi all,

I am trying to read the lzo files. It seems spark recognizes that the
input file is compressed and got the decompressor as

14/07/03 18:11:01 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/07/03 18:11:01 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev
ee825cb06b23d3ab97cdd87e13cbbb630bd75b98]
14/07/03 18:11:01 INFO Configuration.deprecation: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
14/07/03 18:11:01 INFO compress.CodecPool: Got brand-new decompressor
[.lzo]

But it has two issues

1. It just stuck here without doing anything waited for 15 min for a
small files.
2. I used the hadoop-lzo to create the index so that spark can split
the input to multiple maps but spark creates only one mapper.

I am using python with reading using sc.textFile(). Spark version is
of the git master.

Regards,
Gurvinder


issue with running example code

2014-07-01 Thread Gurvinder Singh
Hi,

I am having issue in running scala example code. I have tested and able
to run successfully python example code, but when I run the scala code I
get this error

java.lang.ClassCastException: cannot assign instance of
org.apache.spark.examples.SparkPi$$anonfun$1 to field
org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
org.apache.spark.rdd.MappedRDD

I have compiled spark from the github directly and running with the
command as

spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
--class org.apache.spark.examples.SparkPi 5 --jars
/usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar

Any suggestions will be helpful.

Thanks,
Gurvinder