Needless to say, it works fine with int/string(primitive) type.
On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur archit279tha...@gmail.comwrote:
Hi,
I am facing a general problem with flatmap operation on rdd.
I am doing
MyRdd.flatmap(func(_))
MyRdd.saveAsTextFile(..)
func(Tuple2[Key,
I am trying to load xml in streaming and convert to csv and store it. When
I use textfile it separates the file on \n and hence breaks the parser.
Is it possible to receive the data one file at a time from the hdfs folder ?
Mayur Rustagi
Ph: +919632149971
h
Andrew, couldn't you do in the Scala code:
scala.sys.process.Process(hadoop fs -copyToLocal ...)!
or is that still considered a second step?
hadoop fs is almost certainly going to be better at copying these files
than some memory-to-disk-to-memory serdes within Spark.
--
Christopher T.
Hadn't thought of calling the hadoop process from within the scala code but
that is an improvement over my current process. Thanks for the suggestion
Chris!
It still requires saving to HDFS, dumping out to a file, and then cleaning
that temp directory out of HDFS though so isn't quite my ideal
Take a look at the Mahout xmlinputformat class. That should get you
started.
On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
I am trying to load xml in streaming and convert to csv and store it. When
I use textfile it separates the file on \n and hence breaks the
In my Spark programming thus far my unit of work has been a single row
from an hdfs file by creating an RDD[Array[String]] with something like:
spark.textFile(path).map(_.split(\t))
Now, I'd like to do some work over a large collection of files in which
the unit of work is a single file
Hi there,
*Background:*
I need to do some matrix multiplication stuff inside the mappers, and trying
to choose between Python and Scala for writing the Spark MR jobs. I'm
equally fluent with Python and Java, and find Scala pretty easy too for what
it's worth. Going with Python would let me use
Thank you all for the feedback. As Josh suggested, the issue was due to
extending App.
On Wed, Jan 29, 2014 at 5:57 PM, Josh Rosen rosenvi...@gmail.com wrote:
Try removing the extends App and write a main(args: Array[String])
method instead. I think that App affects the serialization (there
I am also interested in this. My solution now is making a file to a line of
string, i.e. deleting all '\n', then adding filename as the head of line
with a space.
[filename] [space] [content]
Anyone have better ideas ?
2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com写道:
In my Spark
If you are in the Seattle area and interested in learning more about Spark, I
encourage you to join the Seattle Spark Meetup:
http://www.meetup.com/Seattle-Spark-Meetup/
Our first two sessions are scheduled for 3/13 (with Databricks helping us with
the kick off) and 4/9 (with the folks at
Philip, I guess the key problem statement is the large collection of
part? If so this may be helpful, at the HDFS level:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
Otherwise you can always start with an RDD[fileUri] and go from there to an
RDD[(fileUri, read_contents)].
Sent
What is the precise use case and reasoning behind wanting to work on a File as
the record in an RDD?
CombineFileInputFormat may be useful in some way:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
Could it be that you have the same records that you get back from flatMap,
just in a different order?
On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur archit279tha...@gmail.comwrote:
Needless to say, it works fine with int/string(primitive) type.
On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur
Actually - looking at your use case, you may simply be saving the original
RDD
Doing something like:
val newRdd = MyRdd.flatMap(func)
newRdd.saveAsTextFile(...)
May solve your issue.
On Thu, Jan 30, 2014 at 10:17 AM, Evan R. Sparks evan.spa...@gmail.comwrote:
Could it be that you have the
Yes, I do that. But if I go to my worker node and check for the list it has
printed
MyRdd.flatmap(func(_))
MyRdd.saveAsTextFile(..)
func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
//
*println(list)*
list
}
The records differ( only count match).
On Thu, Jan 30,
Thank you for the links! These look very useful.
I do not have a precise use case - at this point I'm just exploring what
is possible/feasible. Like the blog suggests, I might have a bunch of
images lying around and might want to collect meta-data from them. In
my case, I do a lot of NLP
There aren't any guarantees on the order that partitions are combined in
the 'saveAsTextFile' method. Generally the file will be written in
per-partition blocks, but there's no notion of order of the partitions. If
order matters to you you can do a sortByKey at load time.
Can you provide a
I have a few questions about yarn-standalone and yarn-client deployment
modes that are described on the Launching Spark on YARN
http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page.
1) Can someone give me a basic conceptual overview? I am struggling
with understanding the
Hi Jeremy,
Thanks for the reply.
Jeremy Freeman wrote
That said, there's a performance hit. In my testing (v0.8.1) a simple
algorithm, KMeans (the versions included with Spark), is ~2x faster per
iteration in Scala than Python in our set up (private HPC, ~30 nodes, each
with 128GB and 16
Hi,
We're starting to build an analytics framework for our wellness service.
While our data is not yet Big, we'd like to use a framework that will
scale as needed, and Spark seems to be the best around.
I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to
use Spark in
Hi Evans,
Thanks! I didn't know that Sparks has a dependency on JBLAS. That's good to
know. Does this mean I can directly use JBLAS from my Spark MR code and not
worry about the painstaking setup of getting Java to recognize the native
BLAS libraries on my system? Does Spark take care of that?
Hi Jeremy,
Can you try doing a comparison of the Scala ALS code
(https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkALS.scala)
and Python ALS code
(https://github.com/apache/incubator-spark/blob/master/python/examples/als.py)
from the
I¹m trying to upgrade the CassandraTest in Spark to use CQL3.
This is the new InputFormat from Cassandra:
public class CqlPagingInputFormat extends
AbstractColumnFamilyInputFormatMapString, ByteBuffer, MapString,
ByteBuffer
For that, I have the following Scala/RDD code:
val casRdd =
Hola!I have about half a TB of (Lzo compressed protobuf) data that I try loading on to my cluster. I have 20 nodes and I assign 100G for executor memory. -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.executor.memory=100gNow, when I load my dataset, transform it with some
Hello Folks,
I was wondering if anyone was able to successfully setup distributed
caching of jar files using CDH 5/YARN/Spark ? I can not seem to get my
cluster working in that fashion.
Regards,
Paul Schooss
I walked through the example in the second link you gave. The Treasury
Yield example referred there is
herehttps://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java.
Note the InputFormat and
Hi guys,
When we're running a very long job, we would like to show users the current
progress of map and reduce job. After looking at the api document, I don't
find anything for this. However, in Spark UI, I could see the progress of
the task. Is there anything I miss?
Thanks.
Sincerely,
DB
Hi All, after trying a lot, in vain, so pl help...
./run-example org.apache.spark.examples.SparkPi local
*1) SLF4J: Class path contains multiple SLF4J bindings.*
SLF4J: Found binding in
Error 1) I dont think one should be worried about this error. This just
means that it has found two instance of the SLF4j library, one from each of
JAR. And both instances are probably same versions of the SLF4j library
(both come from Spark). So this is not really an error. Its just an
annoying
That depends. By default, the tasks are launched with location preference.
So if there is not free slot currently available on Node 1, Spark will wait
for a free slot. However if enable delay scheduler (see config property
spark.locality.wait), then it may launch tasks on other machines with free
You can enable fair sharing of resources between jobs in Spark. See
http://spark.incubator.apache.org/docs/latest/job-scheduling.html
On Sun, Jan 26, 2014 at 8:25 PM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Please someone throw some light into this.
In lazy scheduling that spark had
31 matches
Mail list logo