Re: how to run spark job on yarn with jni lib?

2014-09-26 Thread Marcelo Vanzin
I assume you did those things in all machines, not just on the machine launching the job? I've seen that workaround used successfully (well, actually, they copied the library to /usr/lib or something, but same idea). On Thu, Sep 25, 2014 at 7:45 PM, taqilabon g945...@gmail.com wrote: You're

Re: how to run spark job on yarn with jni lib?

2014-09-25 Thread Marcelo Vanzin
Hmmm, you might be suffering from SPARK-1719. Not sure what the proper workaround is, but it sounds like your native libs are not in any of the standard lib directories; one workaround might be to copy them there, or add their location to /etc/ld.so.conf (I'm assuming Linux). On Thu, Sep 25,

Re: Question About Submit Application

2014-09-25 Thread Marcelo Vanzin
Then I think it's time for you to look at the Spark Master logs... On Thu, Sep 25, 2014 at 7:51 AM, danilopds danilob...@gmail.com wrote: Hi Marcelo, Yes, I can ping spark-01 and I also include the IP and host in my file /etc/hosts. My VM can ping the local machine too. -- View this

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin
On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote: I am running spark with the default settings in yarn client mode. For some reason yarn always allocates three containers to the application (wondering where it is set?), and only uses two of them. The default number of

Re: SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread Marcelo Vanzin
You can pass the HDFS location of those extra jars in the spark-submit --jars argument. Spark will take care of using Yarn's distributed cache to make them available to the executors. Note that you may need to provide the full hdfs URL (not just the path, since that will be interpreted as a local

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin
Comma separated list of archives to be extracted into the working directory of each executor. On Thu, Sep 25, 2014 at 2:20 PM, Tamas Jambor jambo...@gmail.com wrote: Thank you. Where is the number of containers set? On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You'll need to look at the driver output to have a better idea of what's going on. You can use yarn logs --applicationId blah after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough resources available to service the container request

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
:37 PM, Marcelo Vanzin van...@cloudera.com wrote: You'll need to look at the driver output to have a better idea of what's going on. You can use yarn logs --applicationId blah after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
, Sep 25, 2014 at 12:04 AM, Marcelo Vanzin van...@cloudera.com wrote: You need to use the command line yarn application that I mentioned (yarn logs). You can't look at the logs through the UI after the app stops. On Wed, Sep 24, 2014 at 11:16 AM, Raghuveer Chanda raghuveer.cha...@gmail.com wrote

Re: Question About Submit Application

2014-09-24 Thread Marcelo Vanzin
Sounds like spark-01 is not resolving correctly on your machine (or is the wrong address). Can you ping spark-01 and does that reach the VM where you set up the Spark Master? On Wed, Sep 24, 2014 at 1:12 PM, danilopds danilob...@gmail.com wrote: Hello, I'm learning about Spark Streaming and I'm

Re: spark-submit command-line with --files

2014-09-20 Thread Marcelo Vanzin
Hi chinchu, Where does the code trying to read the file run? Is it running on the driver or on some executor? If it's running on the driver, in yarn-cluster mode, the file should have been copied to the application's work directory before the driver is started. So hopefully just doing new

Re: Task not serializable

2014-09-10 Thread Marcelo Vanzin
You're using hadoopConf, a Configuration object, in your closure. That type is not serializable. You can use -Dsun.io.serialization.extendedDebugInfo=true to debug serialization issues. On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Thanks Sean.

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen so...@cloudera.com wrote: This structure is not specific to Hadoop, but in theory works in any JAR file. You can put JARs in JARs and refer to them with Class-Path entries in META-INF/MANIFEST.MF. Funny that you mention that, since someone internally

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen so...@cloudera.com wrote: What's the Hadoop jar structure in question then? Is it something special like a WAR file? I confess I had never heard of this so thought this was about generic JAR stuff. What I've been told (and Steve's e-mail alludes to)

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-09 Thread Marcelo Vanzin
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also explained in the link to the docs I posted earlier.) The second interpretation below is local: URLs in Spark, but that doesn't work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older either). On Mon, Sep 8,

Re: spark-streaming Could not compute split exception

2014-09-09 Thread Marcelo Vanzin
This has all the symptoms of Yarn killing your executors due to them exceeding their memory limits. Could you check your RM/NM logs to see if that's the case? (The error was because of an executor at domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's log file.) If that's the

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Marcelo Vanzin
Hi, Yes, this is a problem, and I'm not aware of any simple workarounds (or complex one for that matter). There are people working to fix this, you can follow progress here: https://issues.apache.org/jira/browse/SPARK-1239 On Tue, Sep 9, 2014 at 2:54 PM, jbeynon jbey...@gmail.com wrote: I'm

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC. subscripti...@didata.us wrote: user$ pyspark [some-options] --driver-java-options spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar This command line does not look correct. spark.yarn.jar is not a JVM command line option.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. subscripti...@didata.us wrote: user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads. user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar=' /usr/lib/spark/assembly/lib/spark-assembly-*.jar' My question is,

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. subscripti...@didata.us wrote: So just to clarify for me: When specifying 'spark.yarn.jar' as I did above, even if I don't use HDFS to create a RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is still necessary to

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. subscripti...@didata.us wrote: You're probably right about the above because, as seen *below* for pyspark (but probably for other Spark applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is specified, the app invocation

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote: In daily development, it's common to modify your projects and re-run the jobs. If using zip or egg to package your code, you need to do this every time after modification, I think it will be boring. That's why shell

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
Hi Davies, On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. Sure, if you already have the infrastructure in place, it might be worthwhile

Re: How can I start history-server with kerberos HDFS ?

2014-09-03 Thread Marcelo Vanzin
The history server (and other Spark daemons) do not read spark-defaults.conf. There's a bug open to implement that (SPARK-2098), and an open PR to fix it, but it's still not in Spark. On Wed, Sep 3, 2014 at 11:00 AM, Zhanfeng Huo huozhanf...@gmail.com wrote: Hi, I have seted properties in

Re: If master is local, where are master and workers?

2014-09-03 Thread Marcelo Vanzin
local means everything runs in the same process; that means there is no need for master and worker daemons to start processes. On Wed, Sep 3, 2014 at 3:12 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, If launched with “local” as master, where are master

Re: If master is local, where are master and workers?

2014-09-03 Thread Marcelo Vanzin
The only monitoring available is the driver's Web UI, which will generally be available on port 4040. On Wed, Sep 3, 2014 at 3:43 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: How can that single process be monitored? Thanks! -Original Message- From: Marcelo

Re: Hive From Spark

2014-08-21 Thread Marcelo Vanzin
Hi Du, I don't believe the Guava change has made it to the 1.1 branch. The Guava doc says hashInt was added in 12.0, so what's probably happening is that you have and old version of Guava in your classpath before the Spark jars. (Hadoop ships with Guava 11, so that may be the source of your

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: An “unaccepted” reply to this thread from Dean Chen suggested to build Spark with a newer version of Hadoop (2.4.1) and this has worked to some extent. I’m now able to submit jobs (omitting an explicit

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
Ah, sorry, forgot to talk about the second issue. On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: However, now the Spark jobs running in the ApplicationMaster on a given node fails to find the active resourcemanager. Below is a log excerpt from one of the assigned

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
Hi, On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com wrote: Specifying the driver-class-path yields behavior like https://issues.apache.org/jira/browse/SPARK-2420 and https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a can of worms here if I also

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Marcelo Vanzin
My guess is that your test is trying to serialize a closure referencing connectionInfo; that closure will have a reference to the test instance, since the instance is needed to execute that method. Try to make the connectionInfo method local to the method where it's needed, or declare it in an

Re: Spark memory settings on yarn

2014-08-20 Thread Marcelo Vanzin
That command line you mention in your e-mail doesn't look like something started by Spark. Spark would start one of ApplicationMaster, ExecutableRunner or CoarseGrainedSchedulerBackend, not org.apache.hadoop.mapred.YarnChild. On Wed, Aug 20, 2014 at 6:56 PM, centerqi hu cente...@gmail.com wrote:

Re: spark-submit with Yarn

2014-08-19 Thread Marcelo Vanzin
On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja aahuj...@gmail.com wrote: /opt/cloudera/parcels/CDH/bin/spark-submit \ --master yarn \ --deploy-mode client \ This should be enough. But when I view the job 4040 page, SparkUI, there is a single executor (just the driver node) and I see

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Marcelo Vanzin
You could create a copy of the variable inside your Parse class; that way it would be serialized with the instance you create when calling map() below. On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri sunny.k...@gmail.com wrote: Are there any other workarounds that could be used to pass in the

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-12 Thread Marcelo Vanzin
Hi, sorry for the delay. Would you have yarn available to test? Given the discussion in SPARK-2878, this might be a different incarnation of the same underlying issue. The option in Yarn is spark.yarn.user.classpath.first On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom dan...@wibidata.com wrote: I'm

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread Marcelo Vanzin
Could you share what's the cluster manager you're using and exactly where the error shows up (driver or executor)? A quick look reveals that Standalone and Yarn use different options to control this, for example. (Maybe that already should be a bug.) On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom

Re: Initial job has not accepted any resources

2014-08-07 Thread Marcelo Vanzin
There are two problems that might be happening: - You're requesting more resources than the master has available, so your executors are not starting. Given your explanation this doesn't seem to be the case. - The executors are starting, but are having problems connecting back to the driver. In

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
Can you try with -Pyarn instead of -Pyarn-alpha? I'm pretty sure CDH4 ships with the newer Yarn API. On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu linkpatrick...@live.com wrote: Hi, Following the document: # Cloudera CDH 4.2.0 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
that ~4.2 is enough like YARN alpha, which is supported as a one-off as I understand, to work. All bets are off before YARN stable really, in my book. On Thu, Aug 7, 2014 at 6:32 PM, Marcelo Vanzin van...@cloudera.com wrote: Can you try with -Pyarn instead of -Pyarn-alpha? I'm pretty sure CDH4

Re: Create a new object by given classtag

2014-08-04 Thread Marcelo Vanzin
Hello, Try something like this: scala def newFoo[T]()(implicit ct: ClassTag[T]): T = ct.runtimeClass.newInstance().asInstanceOf[T] newFoo: [T]()(implicit ct: scala.reflect.ClassTag[T])T scala newFoo[String]() res2: String = scala newFoo[java.util.ArrayList[String]]() res5:

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Marcelo Vanzin
Discussions about how CDH packages Spark aside, you should be using the spark-class script (assuming you're still in 0.9) instead of executing Java directly. That will make sure that the environment needed to run Spark apps is set up correctly. CDH 5.1 ships with Spark 1.0.0, so it has

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin
sharath.abhis...@gmail.com wrote: Hello Marcelo Vanzin, Can you explain bit more on this? I tried using client mode but can you explain how can i use this port to write the log or output to this port?Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin
You can upload your own log4j.properties using spark-submit's --files argument. On Tue, Jul 22, 2014 at 12:45 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I fixed the error with the yarn-client mode issue which i mentioned in my earlier post. Now i want to edit the log4j.properties to

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin
The spark log classes are based on the actual class names. So if you want to filter out a package's logs you need to specify the full package name (e.g. org.apache.spark.storage instead of just spark.storage). On Tue, Jul 22, 2014 at 2:07 PM, abhiguruvayya sharath.abhis...@gmail.com wrote:

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin
On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr mattcoarr.w...@gmail.com wrote: Thanks Marcelo, I'm not seeing anything in the logs that clearly explains what's causing this to break. One interesting point that we just discovered is that if we run the driver and the slave (worker) on the

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Marcelo Vanzin
Could you share some code (or pseudo-code)? Sounds like you're instantiating the JDBC connection in the driver, and using it inside a closure that would be run in a remote executor. That means that the connection object would need to be serializable. If that sounds like what you're doing, it

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin
at 1:21 PM, Marcelo Vanzin van...@cloudera.com wrote: When I meant the executor log, I meant the log of the process launched by the worker, not the worker. In my CDH-based Spark install, those end up in /var/run/spark/work. If you look at your worker log, you'll see it's launching the executor

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Marcelo Vanzin
Have you looked at the slave machine to see if the process has actually launched? If it has, have you tried peeking into its log file? (That error is printed whenever the executors fail to report back to the driver. Insufficient resources to launch the executor is the most common cause of that,

Re: Spark job tracker.

2014-07-10 Thread Marcelo Vanzin
That output means you're running in yarn-cluster mode. So your code is running inside the ApplicationMaster and has no access to the local terminal. If you want to see the output: - try yarn-client mode, then your code will run inside the launcher process - check the RM web ui and look at the

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Marcelo Vanzin
Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote: Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: HI, I am getting this error. Can anyone help out to explain why is

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote: This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
suggest me how to increase the memory limits or how to tackle this problem. I am a novice. If you want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote: This is generally a side effect of your executor being killed. For example

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Sorry, that would be sc.stop() (not close). On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Rahul, Can you try calling sc.close() at the end of your program, so Spark can clean up after itself? On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani rahulbhojwani2

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
: java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62) Can you help in that? On Wed, Jul 9, 2014 at 2:07 AM, Marcelo Vanzin van...@cloudera.com wrote: Sorry, that would be sc.stop

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Marcelo Vanzin
object in Scala is similar to a class with only static fields / methods in Java. So when you set its fields in the driver, the object does not get serialized and sent to the executors; they have their own copy of the class and its static fields, which haven't been initialized. Use a proper class,

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Marcelo Vanzin
Hi Koert, Could you provide more details? Job arguments, log messages, errors, etc. On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote: i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this?

Re: trying to understand yarn-client mode

2014-06-20 Thread Marcelo Vanzin
On Fri, Jun 20, 2014 at 8:22 AM, Koert Kuipers ko...@tresata.com wrote: thanks! i will try that. i guess what i am most confused about is why the executors are trying to retrieve the jars directly using the info i provided to add jars to my spark context. i mean, thats bound to fail no? i

Re: trying to understand yarn-client mode

2014-06-19 Thread Marcelo Vanzin
Coincidentally, I just ran into the same exception. What's probably happening is that you're specifying some jar file in your job as an absolute local path (e.g. just /home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config has the default FS set to HDFS. So your driver does not know

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Marcelo Vanzin
Ah, not that it should matter, but I'm on Linux and you seem to be on Windows... maybe there is something weird going on with the Windows launcher? On Wed, Jun 11, 2014 at 10:34 AM, Marcelo Vanzin van...@cloudera.com wrote: Just tried this and it worked fine for me: ./bin/spark-shell --jars

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-11 Thread Marcelo Vanzin
The error is saying that your client libraries are older than what your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7). Try double-checking that your build is actually using that version (e.g., by looking at the hadoop jar files in lib_managed/jars). On Wed, Jun 11, 2014 at 2:07 AM, bijoy

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin
Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants. Or you could write the file list to a file, and then use

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin
) But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run the above logic on all the millions files. How should I go about this? Thanks On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Jamal, If what you want is to process lots of files in parallel

Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin
Hi Rahul, I'll just copy paste your question here to aid with context, and reply afterwards. - Can I write the RDD data in excel file along with mapping in apache-spark? Is that a correct way? Isn't that a writing will be a local function and can't be passed over the clusters?? Below is

Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin
Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def

Re: ClassCastExceptions when using Spark shell

2014-05-29 Thread Marcelo Vanzin
Hi Sebastian, That exception generally means you have the class loaded by two different class loaders, and some code is trying to mix instances created by the two different loaded classes. Do you happen to have that class both in the spark jars and in your app's uber-jar? That might explain the

Re: Invalid Class Exception

2014-05-27 Thread Marcelo Vanzin
On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar suman.somasun...@oracle.com wrote: I am running this on a Solaris machine with logical partitions. All the partitions (workers) access the same Spark folder. Can you check whether you have multiple versions of the offending class

Re: unsubscribe

2014-05-19 Thread Marcelo Vanzin
Hey Andrew, Since we're seeing so many of these e-mails, I think it's worth pointing out that it's not really obvious to find unsubscription information for the lists. The community link on the Spark site (http://spark.apache.org/community.html) does not have instructions for unsubscribing; it

Re: problem with hdfs access in spark job

2014-05-16 Thread Marcelo Vanzin
Hi Marcin, On Wed, May 14, 2014 at 7:22 AM, Marcin Cylke marcin.cy...@ext.allegro.pl wrote: - This looks like some problems with HA - but I've checked namenodes during the job was running, and there was no switch between master and slave namenode. 14/05/14 15:25:44 ERROR

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin
the cache. Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use that facility for adding files to the cache or anything like that. But yes, it does benefit from things already cached. On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote: Is that true? I believe

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Marcelo Vanzin
Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a

Re: Spark and Java 8

2014-05-06 Thread Marcelo Vanzin
Hi Kristoffer, You're correct that CDH5 only supports up to Java 7 at the moment. But Yarn apps do not run in the same JVM as Yarn itself (and I believe MR1 doesn't either), so it might be possible to pass arguments in a way that tells Yarn to launch the application master / executors with the

Re: Task not serializable: collect, take

2014-05-01 Thread Marcelo Vanzin
Have you tried making A extend Serializable? On Thu, May 1, 2014 at 3:47 PM, SK skrishna...@gmail.com wrote: Hi, I have the following code structure. I compiles ok, but at runtime it aborts with the error: Exception in thread main org.apache.spark.SparkException: Job aborted: Task not

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread Marcelo Vanzin
Hi, One thing you can do is set the spark version your project depends on to 1.0.0-SNAPSHOT (make sure it matches the version of Spark you're building); then before building your project, run sbt publishLocal on the Spark tree. On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wxh...@gmail.com wrote: i

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Marcelo Vanzin
Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: The goal is to keep an intermediate value per row in memory, which would allow faster subsequent computations. I.e., computeSomething would depend on the previous value from the previous computation. I

Re: Spark is slow

2014-04-21 Thread Marcelo Vanzin
Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote: And, I haven't gotten any answers to my questions. One thing that might explain that is that, at least for me, all (and I mean *all*) of your messages are ending up in my GMail spam folder, complaining that GMail can't

Re: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Marcelo Vanzin
Hi Ken, On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken ken.willi...@windlogics.com wrote: I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but that's not so important. Try adding

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Marcelo Vanzin
Hi Sung, On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: while (true) { rdd.map((row : Array[Double]) = { row[numCols - 1] = computeSomething(row) }).reduce(...) } If it fails at some point, I'd imagine that the intermediate info being stored in

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
Hi Ian, When you run your packaged application, are you adding its jar file to the SparkContext (by calling the addJar() method)? That will distribute the code to all the worker nodes. The failure you're seeing seems to indicate the worker nodes do not have access to your code. On Mon, Apr 14,

Re: Proper caching method

2014-04-14 Thread Marcelo Vanzin
Hi Joe, If you cache rdd1 but not rdd2, any time you need rdd2's result, it will have to be computed. It will use rdd1's cached data, but it will have to compute its result again. On Mon, Apr 14, 2014 at 5:32 AM, Joe L selme...@yahoo.com wrote: Hi I am trying to cache 2Gbyte data and to

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
.) Thanks, Ian On Mon, Apr 14, 2014 at 12:45 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Ian, When you run your packaged application, are you adding its jar file to the SparkContext (by calling the addJar() method)? That will distribute the code to all the worker nodes

Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't

<    1   2   3   4   5