Hello Everyone,
I am going through the source code of rdd and Record readers
There are found 2 classes
1. WholeTextFileRecordReader
2. WholeCombineFileRecordReader ( extends CombineFileRecordReader )
The description of both the classes is perfectly similar.
I am not able to understand why we
Hi Stephen and Sean,
Thanks for correction.
On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:
No, the Maven build is the main one. I would use it unless you have a
need to use the SBT build in particular.
On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody
I notice that spark serialize each task with the dependencies (files and JARs
added to the SparkContext) ,
def serializeWithDependencies(
task: Task[_],
currentFiles: HashMap[String, Long],
currentJars: HashMap[String, Long],
serializer: SerializerInstance)
:
I thought I'd ask first since there's a good chance this isn't a
problem, but, I'm having a problem wherein the first batch that Spark
Streaming processes fails (due to an app problem), but then, stop()
blocks a very long time.
This bit of JobGenerator.stop() executes, since the message appears
The current design is not ideal, but the size of dependencies should be
fairly small since we only send the path and timestamp, not the jars
themselves.
Executors can come and go. This is essentially a state replication problem
that you gotta be very careful with consistency.
On Sun, Nov 16,
I'm going to have to disagree here. If you are building a release
distribution or integrating with legacy systems then maven is probably the
correct choice. However most of the core developers that I know use sbt,
and I think its a better choice for exploration and development overall.
That
Yeah, my comment was mostly reflecting the fact that mvn is what
creates the releases and is the 'build of reference', from which the
SBT build is generated. The docs were recently changed to suggest that
Maven is the default build and SBT is for advanced users. I find Maven
plays nicer with IDEs,
HI Michael,
That insight is useful. Some thoughts:
* I moved from sbt to maven in June specifically due to Andrew Or's
describing mvn as the default build tool. Developers should keep in mind
that jenkins uses mvn so we need to run mvn before submitting PR's - even
if sbt were used for day to
The console mode of sbt (just run
sbt/sbt and then a long running console session is started that will accept
further commands) is great for building individual subprojects or running
single test suites. In addition to being faster since its a long running
JVM, its got a lot of nice
Neither is strictly optimal which is why we ended up supporting both.
Our reference build for packaging is Maven so you are less likely to
run into unexpected dependency issues, etc. Many developers use sbt as
well. It's somewhat religion and the best thing might be to try both
and see which you
Ok, strictly speaking, that's equivalent to your second class of
examples, development
console, not the first sbt console
On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com
wrote:
The console mode of sbt (just run
sbt/sbt and then a long running console session is started
This is more a curiosity than an immediate problem.
Here is my question: I ran into this easily solved issue
http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
recently. The solution was to replace my class with a scala
That's a great idea and it is also a pain point for some users. However, it
is not possible to solve this problem at compile time, because the content
of serialization can only be determined at runtime.
There are some efforts in Scala to help users avoid mistakes like this. One
example project
Hi Jay,
I just came across SPARK-720 Statically guarantee serialization will succeed
https://issues.apache.org/jira/browse/SPARK-720 which sounds like exactly
what you're referring to. Like Reynold I think it's not possible at this
time but it would be good to get your feedback on that ticket.
I don't think the code is immediately obvious.
Davies - I think you added the code, and Josh reviewed it. Can you guys
explain and maybe submit a patch to add more documentation on the whole
thing?
Thanks.
On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com
wrote:
-1
I found a potential regression in 1.1.1 related to spark-submit and cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
I think that this is worth fixing.
On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote:
+1
Tested HiveThriftServer2 against Hive
Filed as https://issues.apache.org/jira/browse/SPARK-4437
On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote:
I don't think the code is immediately obvious.
Davies - I think you added the code, and Josh reviewed it. Can you guys
explain and maybe submit a patch to add
Now I've finished to revert for SPARK-4434 and opened PR.
(2014/11/16 17:08), Josh Rosen wrote:
-1
I found a potential regression in 1.1.1 related to spark-submit and cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
I think that this is worth fixing.
On Fri, Nov 14, 2014
Hi Dinesh, Sean, Michael, Stephen, Mark, and Patrick
Thank you for your reply and discussions. So the conclusion is that mvn is
preferred when packaging and distribution, while sbt is better for development.
This also explains why the compilation tool of make-distribution.sh changed
from
More or less correct, but I'd add that there are an awful lot of software
systems out there that use Maven. Integrating with those systems is
generally easier if you are also working with Spark in Maven. (And I
wouldn't classify all of those Maven-built systems as legacy, Michael :)
What that
Ashutosh,
The counter will certainly be an parellization issue when multiple nodes are
used specially over massive datasets.
A better approach would be to use some thing along these lines:
val index = sc.parallelize(Range.Long(0, rdd.count, 1),
rdd.partitions.size)
val rddWithIndex =
21 matches
Mail list logo