JavaRDD with custom class?

2016-04-12 Thread Daniel Valdivia
Hi,

I'm moving some code from Scala to Java and I just hit a wall where I'm trying 
to move an RDD with a custom data structure to java, but I'm not being able to 
do so:

Scala Code:

case class IncodentDoc(system_id: String, category: String, terms: Seq[String])
var incTup = inc_filtered.map(record => {
 //some logic
  TermDoc(sys_id, category, termsSeq)
})

On Java I'm trying:

class TermDoc implements Serializable  {
public String system_id;
public String category;
public String[] terms;

public TermDoc(String system_id, String category, String[] terms) {
this.system_id = system_id;
this.category = category;
this.terms = terms;
}
}

JavaRDD incTup = inc_filtered.map(record -> {
//some code
return new TermDoc(sys_id, category, termsArr);
});


When I run my code, I get hit with a Task not serializable error, what am I 
missing so I can use custom classes inside the RDD just like in scala?

Cheers
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to deal with same class mismatch?

2016-02-01 Thread Daniel Valdivia
Hi, I'm having a couple of issues.

I'm experiencing a known issue 
 on the spark-shell where I'm 
getting a type mismatch for the right class

:82: error: type mismatch; 
found : 
org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
 required: 
org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]

I was wondering if anyone has found a way around this?

I was trying to dump my RDD into a brand new RDD, and each element of the 
LabeledPoint into a new one, just in case there was the internal class causing 
the problem however I can't seem to be able to access the vectors inside my 
LabeledPoint. in the process of generating my RDD I did used Java Maps and 
converted them back to scala

Any advice on how to remap this LabelPoint?

tfidfs.take(1)
res143: Array[org.apache.spark.mllib.regression.LabeledPoint] = 
Array((143.0,(7175,[2738,4134,4756,6354,6424],[492.63076923076926,11.060794473229707,2.7010544074230283,57.69549549549549,76.2404761904762])))

tfidfs.take(1)(0).label
res144: Double = 143.0 

tfidfs.take(1)(0).features
res145: org.apache.spark.mllib.linalg.Vector = 
(7175,[2738,4134,4756,6354,6424],[492.63076923076926,11.060794473229707,2.7010544074230283,57.69549549549549,76.2404761904762])
 

tfidfs.take(1)(0).features(0)
res146: Double = 0.0 

tfidfs.take(1)(0).features(1)
res147: Double = 0.0 

tfidfs.take(1)(0).features(2)
res148: Double = 0.0

Set Hadoop User in Spark Shell

2016-01-14 Thread Daniel Valdivia
Hi,

I'm trying to set the value of a hadoop parameter within spark-shell, and 
System.setProperty("HADOOP_USER_NAME", "hadoop") seem to not be doing the trick

Does anything know how I can set the hadoop.job.ugi parameter from within 
spark-shell ?

Cheers
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Put all elements of RDD into array

2016-01-11 Thread Daniel Valdivia
The ArrayBuffer did the trick!

Thanks a lot! I'm learning Scala through spark so these details are still new 
to me 

Sent from my iPhone

> On Jan 11, 2016, at 5:18 PM, Jakob Odersky <joder...@gmail.com> wrote:
> 
> Hey,
> I just reread your question and saw I overlooked some crucial information. 
> Here's a solution:
> 
> val data = 
> model.asInstanceOf[DistributedLDAModel].topicDistributions.sortByKey().collect()
> 
> val tpdist = data.map(doc => doc._2.toArray)
> 
> hope it works this time
> 
>> On 11 January 2016 at 17:14, Jakob Odersky <joder...@gmail.com> wrote:
>> Hi Daniel,
>> 
>> You're actually not modifying the original array: `array :+ x ` will give 
>> you a new array with `x` appended to it.
>> In your case the fix is simple: collect() already returns an array, use it 
>> as the assignment value to your val.
>> 
>> In case you ever want to append values iteratively, search for how to use 
>> scala "ArrayBuffer"s. Also, keep in mind that RDDs have a foreach method, so 
>> no need to call collect followed by foreach.
>> 
>> regards,
>> --Jakob
>> 
>>  
>> 
>>> On 11 January 2016 at 16:55, Daniel Valdivia <h...@danielvaldivia.com> 
>>> wrote:
>>> Hello,
>>> 
>>> I'm trying to put all the values in pair rdd into an array (or list) for 
>>> later storing, however even if I'm collecting the data then pushing it to 
>>> the array the array size after the run is 0.
>>> 
>>> Any idea on what I'm missing? 
>>> 
>>> Thanks in advance
>>> 
>>> scala> val tpdist: Array[Array[Double]]  = Array()
>>> tpdist: Array[Array[Double]] = Array()
>>> 
>>> scala> 
>>> ldaModel.asInstanceOf[DistributedLDAModel].topicDistributions.sortByKey().collect().foreach(doc
>>>  => tpdist :+ doc._2.toArray )
>>> 
>>> 
>>> scala> tpdist.size
>>> res27: Int = 0
> 


Monitor Job on Yarn

2016-01-04 Thread Daniel Valdivia
Hello everyone, happy new year,

I submitted an app to yarn, however I'm unable to monitor it's progress on the 
driver node, not in :8080 or :4040 as documented, 
when submitting to the standalone mode I could monitor however seems liek its 
not the case right now.

I submitted my app this way:

spark-submit --class my.class --master yarn --deploy-mode cluster myjar.jar

and so far the job is on it's way it seems, the console is vivid with 
Application report messages, however I can't access the status of the app, 
should I have submitted the app in a different fashion to access the status of 
it?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Monitor Job on Yarn

2016-01-04 Thread Daniel Valdivia
I see, I guess I should have set the historyServer.


Strangely enough peeking in the yarn seems like nothing is "happening", it list 
a single application running with 0% progress but each node has 0 running 
containers which confuses me to wether anything is actually happening

Should I restart the job with the spark.yarn.historyServer.address ?

[hadoop@sslabnode02 ~]$ yarn node -list
16/01/04 14:43:40 INFO client.RMProxy: Connecting to ResourceManager at 
sslabnode01/10.15.235.239:8032
16/01/04 14:43:40 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total Nodes:3
 Node-Id Node-State Node-Http-Address   
Number-of-Running-Containers
sslabnode02:54142   RUNNING  sslabnode02:8042   
   0
sslabnode01:60780   RUNNING  sslabnode01:8042   
   0
sslabnode03:60569   RUNNING  sslabnode03:8042   
   0
[hadoop@sslabnode02 ~]$ yarn application -list
16/01/04 14:43:55 INFO client.RMProxy: Connecting to ResourceManager at 
sslabnode01/10.15.235.239:8032
16/01/04 14:43:55 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total number of applications (application-types: [] and states: [SUBMITTED, 
ACCEPTED, RUNNING]):1
Application-Id  Application-NameApplication-Type
  User   Queue   State Final-State  
   ProgressTracking-URL
application_1451947397662_0001  ClusterIncidents   SPARK
hadoop defaultACCEPTED   UNDEFINED  
 0% N/A

> On Jan 4, 2016, at 2:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> Please look at history server related content under:
> https://spark.apache.org/docs/latest/running-on-yarn.html 
> <https://spark.apache.org/docs/latest/running-on-yarn.html>
> 
> Note spark.yarn.historyServer.address
> FYI
> 
> On Mon, Jan 4, 2016 at 2:49 PM, Daniel Valdivia <h...@danielvaldivia.com 
> <mailto:h...@danielvaldivia.com>> wrote:
> Hello everyone, happy new year,
> 
> I submitted an app to yarn, however I'm unable to monitor it's progress on 
> the driver node, not in :8080 or :4040 as 
> documented, when submitting to the standalone mode I could monitor however 
> seems liek its not the case right now.
> 
> I submitted my app this way:
> 
> spark-submit --class my.class --master yarn --deploy-mode cluster myjar.jar
> 
> and so far the job is on it's way it seems, the console is vivid with 
> Application report messages, however I can't access the status of the app, 
> should I have submitted the app in a different fashion to access the status 
> of it?
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Re: Can't submit job to stand alone cluster

2015-12-29 Thread Daniel Valdivia
nt and driver run on separate machines; 
> additionally driver runs as a thread in ApplicationMaster; use --jars option 
> with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver is 
> NOT a thread in ApplicationMaster; use --packages to submit a jar
> 
> 
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> 
> wrote:
> 
> 
> Hi Greg,
> 
> It's actually intentional for standalone cluster mode to not upload jars. One 
> of the reasons why YARN takes at least 10 seconds before running any simple 
> application is because there's a lot of random overhead (e.g. putting jars in 
> HDFS). If this missing functionality is not documented somewhere then we 
> should add that.
> 
> Also, the packages problem seems legitimate. Thanks for reporting it. I have 
> filed https://issues.apache.org/jira/browse/SPARK-12559.
> 
> -Andrew
> 
> 2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>:
> 
> 
> On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote:
> 
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
> 
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
> 
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> 
> Greg
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


Can't submit job to stand alone cluster

2015-12-28 Thread Daniel Valdivia
Hi,

I'm trying to submit a job to a small spark cluster running in stand alone 
mode, however it seems like the jar file I'm submitting to the cluster is "not 
found" by the workers nodes.

I might have understood wrong, but I though the Driver node would send this jar 
file to the worker nodes, or should I manually send this file to each worker 
node before I submit the job?

what I'm doing:

 $SPARK_HOME/bin/spark-submit --master spark://sslabnode01:6066 --deploy-mode 
cluster  --class ClusterIncidents 
./target/scala-2.10/cluster-incidents_2.10-1.0.jar 

The error I'm getting:

Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/28 15:13:58 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://sslabnode01:6066.
15/12/28 15:13:59 INFO RestSubmissionClient: Submission successfully created as 
driver-20151228151359-0003. Polling submission state...
15/12/28 15:13:59 INFO RestSubmissionClient: Submitting a request for the 
status of submission driver-20151228151359-0003 in spark://sslabnode01:6066.
15/12/28 15:13:59 INFO RestSubmissionClient: State of driver 
driver-20151228151359-0003 is now ERROR.
15/12/28 15:13:59 INFO RestSubmissionClient: Driver is running on worker 
worker-20151218150246-10.15.235.241-52077 at 10.15.235.241:52077.
15/12/28 15:13:59 ERROR RestSubmissionClient: Exception from the cluster:
java.io.FileNotFoundException: 
/home/hadoop/git/scalaspark/./target/scala-2.10/cluster-incidents_2.10-1.0.jar 
(No such file or directory)
java.io.FileInputStream.open(Native Method)
java.io.FileInputStream.(FileInputStream.java:146)

org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)

org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
org.spark-project.guava.io.Files.copy(Files.java:436)

org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)

org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)

org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
15/12/28 15:13:59 INFO RestSubmissionClient: Server responded with 
CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20151228151359-0003",
  "serverSparkVersion" : "1.5.2",
  "submissionId" : "driver-20151228151359-0003",
  "success" : true
}

Thanks in advance



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Missing dependencies when submitting scala app

2015-12-23 Thread Daniel Valdivia
Hi Jeff,

The problem was I was pulling json4s 3.3.0 which seems to have some problem
with spark, I switched to 3.2.11 and everything is fine now

On Tue, Dec 22, 2015 at 5:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> It might be jar conflict issue. Spark has dependency org.json4s.jackson,
> do you also specify org.json4s.jackson in your sbt dependency but with a
> different version ?
>
> On Wed, Dec 23, 2015 at 6:15 AM, Daniel Valdivia <h...@danielvaldivia.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to bundle dependendies with a scala
>> application, so far my code was tested successfully on the spark-shell
>> however now that I'm trying to run it as a stand alone application which
>> I'm compilin with sbt is yielding me the error:
>>
>>
>> *java.lang.NoSuchMethodError:
>> org.json4s.jackson.JsonMethods$.parse$default$3()Z at
>> ClusterIncidents$$anonfun$1.apply(ClusterInciden*
>>
>> I'm doing "sbt clean package" and then spark-submit of the resulting jar,
>> however seems like either my driver or workers don't have the json4s
>> dependency, therefor can't find the parse method
>>
>> Any idea on how to solve this depdendency problem?
>>
>> thanks in advance
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Missing dependencies when submitting scala app

2015-12-22 Thread Daniel Valdivia
Hi,

I'm trying to figure out how to bundle dependendies with a scala application, 
so far my code was tested successfully on the spark-shell however now that I'm 
trying to run it as a stand alone application which I'm compilin with sbt is 
yielding me the error:

java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
at ClusterIncidents$$anonfun$1.apply(ClusterInciden

I'm doing "sbt clean package" and then spark-submit of the resulting jar, 
however seems like either my driver or workers don't have the json4s 
dependency, therefor can't find the parse method

Any idea on how to solve this depdendency problem?

thanks in advance

Scala VS Java VS Python

2015-12-16 Thread Daniel Valdivia
Hello,

This is more of a "survey" question for the community, you can reply to me 
directly so we don't flood the mailing list.

I'm having a hard time learning Spark using Python since the API seems to be 
slightly incomplete, so I'm looking at my options to start doing all my apps in 
either Scala or Java, being a Java Developer, java 1.8 looks like the logical 
way, however I'd like to ask here what's the most common (Scala Or Java) since 
I'm observing mixed results in the social documentation, however Scala seems to 
be the predominant language for spark examples.

Thank for the advice
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Daniel Valdivia
Hi Abhishek,

Thanks for your suggestion, I did considered it, but I'm not sure if to
achieve that I'd ned to collect() the data first, I don't think it would
fit into the Driver memory.

Since I'm trying all of this inside the pyspark shell I'm using a small
dataset, however the main dataset is about 1.5gb of data, and my cluster
has only 2gb of ram nodes (2 of them).

Do you think that your suggestion could work without having to collect()
the results?

Thanks in advance!

On Wed, Dec 16, 2015 at 4:26 AM, Abhishek Shivkumar <
abhisheksgum...@gmail.com> wrote:

> Hello Daniel,
>
>   I was thinking if you can write
>
> catGroupArr.map(lambda line: create_and_write_file(line))
>
> def create_and_write_file(line):
>
> 1. look at the key of line: line[0]
> 2. Open a file with required file name based on key
> 3. iterate through the values of this key,value pair
>
>for ele in line[1]:
>
> 4. Write every ele into the file created.
> 5. Close the file.
>
> Do you think this works?
>
> Thanks
> Abhishek S
>
>
> Thank you!
>
> With Regards,
> Abhishek S
>
> On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia <h...@danielvaldivia.com>
> wrote:
>
>> Hello everyone,
>>
>> I have a PairRDD with a set of key and list of values, each value in the
>> list is a json which I already loaded beginning of my spark app, how can I
>> iterate over each value of the list in my pair RDD to transform it to a
>> string then save the whole content of the key to a file? one file per key
>>
>> my input files look like cat-0-500.txt:
>>
>> *{cat:'red',value:'asd'}*
>> *{cat:'green',value:'zxc'}*
>> *{cat:'red',value:'jkl'}*
>>
>> The PairRDD looks like
>>
>> *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])*
>> *('green', [{cat:'green',value:'zxc'}])*
>>
>> so as you can see I I'd like to serialize each json in the value list
>> back to string so I can easily saveAsTextFile(), ofcourse I'm trying to
>> save a separate file for each key
>>
>> The way I got here:
>>
>> *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")*
>> *import json*
>> *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))*
>> *categories = categoriesJson*
>>
>> *catByDate = categories.map(lambda x: (x['cat'], x)*
>> *catGroup = catByDate.groupByKey()*
>> *catGroupArr = catGroup.mapValues(lambda x : list(x))*
>>
>> Ideally I want to create a cat-red.txt that looks like:
>>
>> {cat:'red',value:'asd'}
>> {cat:'red',value:'jkl'}
>>
>> and the same for the rest of the keys.
>>
>> I already looked at this answer
>> <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
>>  but
>> I'm slightly lost as host to process each value in the list (turn into
>> string) before I save the contents to a file, also I cannot figure out how
>> to import *MultipleTextOutputFormat* in python either.
>>
>> I'm trying all this wacky stuff in the pyspark shell
>>
>> Any advice would be greatly appreciated
>>
>> Thanks in advance!
>>
>
>


Access row column by field name

2015-12-16 Thread Daniel Valdivia
Hi,

I'm processing the json I have in a text file using DataFrames, however right 
now I'm trying to figure out a way to access a certain value within the rows of 
my data frame if I only know the field name and not the respective field 
position in the schema.

I noticed that row.schema and row.dtypes give me information about the 
auto-generate schema, but I cannot see a straigh forward patch for this, I'm 
trying to create a PairRdd out of this 

Is there any easy way to figure out the field position by it's field name (the 
key it had in the json)?

so this

val sqlContext = new SQLContext(sc)
val rawIncRdd = 
sc.textFile("hdfs://1.2.3.4:8020/user/hadoop/incidents/unstructured/inc-0-500.txt")
 val df = sqlContext.jsonRDD(rawIncRdd)
df.foreach(line => println(line.getString(0)))


would turn into something like this

val sqlContext = new SQLContext(sc)
val rawIncRdd = 
sc.textFile("hdfs://1.2.3.4:8020/user/hadoop/incidents/unstructured/inc-0-500.txt")
 val df = sqlContext.jsonRDD(rawIncRdd)
df.foreach(line => println(line.getString("field_name")))

thanks for the advice

PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-15 Thread Daniel Valdivia
Hello everyone,

I have a PairRDD with a set of key and list of values, each value in the list 
is a json which I already loaded beginning of my spark app, how can I iterate 
over each value of the list in my pair RDD to transform it to a string then 
save the whole content of the key to a file? one file per key

my input files look like cat-0-500.txt:

{cat:'red',value:'asd'}
{cat:'green',value:'zxc'}
{cat:'red',value:'jkl'}

The PairRDD looks like

('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])
('green', [{cat:'green',value:'zxc'}])

so as you can see I I'd like to serialize each json in the value list back to 
string so I can easily saveAsTextFile(), ofcourse I'm trying to save a separate 
file for each key

The way I got here:

rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")
import json
categoriesJson = rawcatRdd.map(lambda x: json.loads(x))
categories = categoriesJson

catByDate = categories.map(lambda x: (x['cat'], x)
catGroup = catByDate.groupByKey()
catGroupArr = catGroup.mapValues(lambda x : list(x))

Ideally I want to create a cat-red.txt that looks like:

{cat:'red',value:'asd'}
{cat:'red',value:'jkl'}

and the same for the rest of the keys.

I already looked at this answer 

 but I'm slightly lost as host to process each value in the list (turn into 
string) before I save the contents to a file, also I cannot figure out how to 
import MultipleTextOutputFormat in python either.

I'm trying all this wacky stuff in the pyspark shell

Any advice would be greatly appreciated

Thanks in advance!