Re: Problem with flatmap.

2014-01-30 Thread Archit Thakur
Needless to say, it works fine with int/string(primitive) type. On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur archit279tha...@gmail.comwrote: Hi, I am facing a general problem with flatmap operation on rdd. I am doing MyRdd.flatmap(func(_)) MyRdd.saveAsTextFile(..) func(Tuple2[Key,

Streaming files as a whole

2014-01-30 Thread Mayur Rustagi
I am trying to load xml in streaming and convert to csv and store it. When I use textfile it separates the file on \n and hence breaks the parser. Is it possible to receive the data one file at a time from the hdfs folder ? Mayur Rustagi Ph: +919632149971 h

Re: Stream RDD to local disk

2014-01-30 Thread Christopher Nguyen
Andrew, couldn't you do in the Scala code: scala.sys.process.Process(hadoop fs -copyToLocal ...)! or is that still considered a second step? hadoop fs is almost certainly going to be better at copying these files than some memory-to-disk-to-memory serdes within Spark. -- Christopher T.

Re: Stream RDD to local disk

2014-01-30 Thread Andrew Ash
Hadn't thought of calling the hadoop process from within the scala code but that is an improvement over my current process. Thanks for the suggestion Chris! It still requires saving to HDFS, dumping out to a file, and then cleaning that temp directory out of HDFS though so isn't quite my ideal

Re: Streaming files as a whole

2014-01-30 Thread Woody Christy
Take a look at the Mahout xmlinputformat class. That should get you started. On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: I am trying to load xml in streaming and convert to csv and store it. When I use textfile it separates the file on \n and hence breaks the

RDD[URI]

2014-01-30 Thread Philip Ogren
In my Spark programming thus far my unit of work has been a single row from an hdfs file by creating an RDD[Array[String]] with something like: spark.textFile(path).map(_.split(\t)) Now, I'd like to do some work over a large collection of files in which the unit of work is a single file

Python API Performance

2014-01-30 Thread nileshc
Hi there, *Background:* I need to do some matrix multiplication stuff inside the mappers, and trying to choose between Python and Scala for writing the Spark MR jobs. I'm equally fluent with Python and Java, and find Scala pretty easy too for what it's worth. Going with Python would let me use

Re: Non-interactive job fails to copy local variable to remote machines

2014-01-30 Thread Michael Diamant
Thank you all for the feedback. As Josh suggested, the issue was due to extending App. On Wed, Jan 29, 2014 at 5:57 PM, Josh Rosen rosenvi...@gmail.com wrote: Try removing the extends App and write a main(args: Array[String]) method instead. I think that App affects the serialization (there

Re: RDD[URI]

2014-01-30 Thread 尹绪森
I am also interested in this. My solution now is making a file to a line of string, i.e. deleting all '\n', then adding filename as the head of line with a space. [filename] [space] [content] Anyone have better ideas ? 2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com写道: In my Spark

Seattle Spark Meetup

2014-01-30 Thread Denny Lee
If you are in the Seattle area and interested in learning more about Spark, I encourage you to join the Seattle Spark Meetup: http://www.meetup.com/Seattle-Spark-Meetup/ Our first two sessions are scheduled for 3/13 (with Databricks helping us with the kick off) and 4/9 (with the folks at

Re: RDD[URI]

2014-01-30 Thread Christopher Nguyen
Philip, I guess the key problem statement is the large collection of part? If so this may be helpful, at the HDFS level: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Otherwise you can always start with an RDD[fileUri] and go from there to an RDD[(fileUri, read_contents)]. Sent

Re: RDD[URI]

2014-01-30 Thread Nick Pentreath
What is the precise use case and reasoning behind wanting to work on a File as the record in an RDD? CombineFileInputFormat may be useful in some way:  http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

Re: Problem with flatmap.

2014-01-30 Thread Evan R. Sparks
Could it be that you have the same records that you get back from flatMap, just in a different order? On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur archit279tha...@gmail.comwrote: Needless to say, it works fine with int/string(primitive) type. On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur

Re: Problem with flatmap.

2014-01-30 Thread Evan R. Sparks
Actually - looking at your use case, you may simply be saving the original RDD Doing something like: val newRdd = MyRdd.flatMap(func) newRdd.saveAsTextFile(...) May solve your issue. On Thu, Jan 30, 2014 at 10:17 AM, Evan R. Sparks evan.spa...@gmail.comwrote: Could it be that you have the

Re: Problem with flatmap.

2014-01-30 Thread Archit Thakur
Yes, I do that. But if I go to my worker node and check for the list it has printed MyRdd.flatmap(func(_)) MyRdd.saveAsTextFile(..) func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = { // *println(list)* list } The records differ( only count match). On Thu, Jan 30,

Re: RDD[URI]

2014-01-30 Thread Philip Ogren
Thank you for the links! These look very useful. I do not have a precise use case - at this point I'm just exploring what is possible/feasible. Like the blog suggests, I might have a bunch of images lying around and might want to collect meta-data from them. In my case, I do a lot of NLP

Re: Problem with flatmap.

2014-01-30 Thread Evan R. Sparks
There aren't any guarantees on the order that partitions are combined in the 'saveAsTextFile' method. Generally the file will be written in per-partition blocks, but there's no notion of order of the partitions. If order matters to you you can do a sortByKey at load time. Can you provide a

various questions about yarn-standalone vs. yarn-client

2014-01-30 Thread Philip Ogren
I have a few questions about yarn-standalone and yarn-client deployment modes that are described on the Launching Spark on YARN http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page. 1) Can someone give me a basic conceptual overview? I am struggling with understanding the

Re: Python API Performance

2014-01-30 Thread nileshc
Hi Jeremy, Thanks for the reply. Jeremy Freeman wrote That said, there's a performance hit. In my testing (v0.8.1) a simple algorithm, KMeans (the versions included with Spark), is ~2x faster per iteration in Scala than Python in our set up (private HPC, ~30 nodes, each with 128GB and 16

Spark + MongoDB

2014-01-30 Thread Sampo Niskanen
Hi, We're starting to build an analytics framework for our wellness service. While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around. I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in

Re: Python API Performance

2014-01-30 Thread nileshc
Hi Evans, Thanks! I didn't know that Sparks has a dependency on JBLAS. That's good to know. Does this mean I can directly use JBLAS from my Spark MR code and not worry about the painstaking setup of getting Java to recognize the native BLAS libraries on my system? Does Spark take care of that?

Re: Python API Performance

2014-01-30 Thread nileshc
Hi Jeremy, Can you try doing a comparison of the Scala ALS code (https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkALS.scala) and Python ALS code (https://github.com/apache/incubator-spark/blob/master/python/examples/als.py) from the

CQL3 Example (Scala Noobie Question)

2014-01-30 Thread Brian O'Neill
I¹m trying to upgrade the CassandraTest in Spark to use CQL3. This is the new InputFormat from Cassandra: public class CqlPagingInputFormat extends AbstractColumnFamilyInputFormatMapString, ByteBuffer, MapString, ByteBuffer For that, I have the following Scala/RDD code: val casRdd =

Kyro serialization slow and runs OOM

2014-01-30 Thread Vipul Pandey
Hola!I have about half a TB of (Lzo compressed protobuf) data that I try loading on to my cluster. I have 20 nodes and I assign 100G for executor memory. -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.executor.memory=100gNow, when I load my dataset, transform it with some

Configuring distributed caching with Spark and YARN

2014-01-30 Thread Paul Schooss
Hello Folks, I was wondering if anyone was able to successfully setup distributed caching of jar files using CDH 5/YARN/Spark ? I can not seem to get my cluster working in that fashion. Regards, Paul Schooss

Re: Spark + MongoDB

2014-01-30 Thread Tathagata Das
I walked through the example in the second link you gave. The Treasury Yield example referred there is herehttps://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java. Note the InputFormat and

Is there a way to get the current progress of the job?

2014-01-30 Thread DB Tsai
Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB

SLF4J error and log4j error

2014-01-30 Thread Sai Prasanna
Hi All, after trying a lot, in vain, so pl help... ./run-example org.apache.spark.examples.SparkPi local *1) SLF4J: Class path contains multiple SLF4J bindings.* SLF4J: Found binding in

Re: SLF4J error and log4j error

2014-01-30 Thread Tathagata Das
Error 1) I dont think one should be worried about this error. This just means that it has found two instance of the SLF4j library, one from each of JAR. And both instances are probably same versions of the SLF4j library (both come from Spark). So this is not really an error. Its just an annoying

Re: Distributed Shared Access of Cached RDD's

2014-01-30 Thread Tathagata Das
That depends. By default, the tasks are launched with location preference. So if there is not free slot currently available on Node 1, Spark will wait for a free slot. However if enable delay scheduler (see config property spark.locality.wait), then it may launch tasks on other machines with free

Re: Lazy Scheduling

2014-01-30 Thread Tathagata Das
You can enable fair sharing of resources between jobs in Spark. See http://spark.incubator.apache.org/docs/latest/job-scheduling.html On Sun, Jan 26, 2014 at 8:25 PM, Sai Prasanna ansaiprasa...@gmail.comwrote: Please someone throw some light into this. In lazy scheduling that spark had