Yep, and it works fine for operations which does not involve any shuffle
(like foreach,, count etc) and those which involves shuffle operations ends
up in an infinite loop. Spark should somehow indicate this instead of going
in an infinite loop.
Thanks
Best Regards
On Thu, Aug 13, 2015 at 11:37
What I understood from Imran's mail (and what was referenced in his
mail) the RDD mentioned seems to be violating some basic contracts on
how partitions are used in spark [1].
They cannot be arbitrarily numbered,have duplicates, etc.
Extending RDD to add functionality is typically for niche
Hi, Sea
Problem solved, it turn out to be that I have updated spark cluster to
1.4.1, however the client has not been updated.
Thank you so much.
Sea 261810...@qq.com于2015年8月14日周五 下午1:01写道:
I have no idea... We use scala. You upgrade to 1.4 so quickly..., are you
using spark in
Hello,
I have written a sbt plugin called spark-deployer, which is able to
deploy a standalone spark cluster on aws ec2 and submit jobs to it.
https://github.com/pishen/spark-deployer
Compared to current spark-ec2 script, this design may have several
benefits (features):
1. All the code are
Has anyone had success using this preview? We were able to build the preview,
and able to start the spark-master, however, unable to connect any spark
workers to it.
Kept receiving AkkaRpcEnv address in use while attempting to connect the
spark-worker to the master. Also confirmed that the
Hi,
All I want to do is that,
1. read from some source
2. do some calculation to get some byte array
3. write the byte array to hdfs
In hadoop, I can share an ImmutableByteWritable, and do some
System.arrayCopy, it will prevent the application from creating a lot of
small
I am thinking that creating a shared object outside the closure, use this
object to hold the byte array.
will this work?
周千昊 qhz...@apache.org于2015年8月14日周五 下午4:02写道:
Hi,
All I want to do is that,
1. read from some source
2. do some calculation to get some byte array
3. write
Thanks for the clarifications Mrithul.
Thanks
Best Regards
On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan mri...@gmail.com
wrote:
What I understood from Imran's mail (and what was referenced in his
mail) the RDD mentioned seems to be violating some basic contracts on
how partitions are
Sorry for previous line-breaking format, try to resend the mail again.
I have written a sbt plugin called spark-deployer, which is able to deploy
a standalone spark cluster on aws ec2 and submit jobs to it.
https://github.com/pishen/spark-deployer
Compared to current spark-ec2 script, this
You can use mapPartitions to do that.
On Friday, August 14, 2015, 周千昊 qhz...@apache.org wrote:
I am thinking that creating a shared object outside the closure, use this
object to hold the byte array.
will this work?
周千昊 qhz...@apache.org
On Fri, Aug 14, 2015 at 4:21 AM, Josh Rosen rosenvi...@gmail.com wrote:
Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59
On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen rosenvi...@gmail.com wrote:
*TL;DR*: would anyone object if I wrote a script to auto-delete pull
A workaround would be to have multiple passes on the RDD and each pass write
its own output?
Or in a foreachPartition do it in a single pass (open up multiple files per
partition to write out)?
-Abhishek-
On Aug 14, 2015, at 7:56 AM, Silas Davis si...@silasdavis.net wrote:
Would it be right
This is already supported with the new partitioned data sources in
DataFrame/SQL right?
On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com
wrote:
Speaking about Shopify's deployment, this would be a really nice to have
feature.
We would like to write data to folders
What is the best way to bring such a huge file from a FTP server into Hadoop to
persist in HDFS? Since a single jvm process might run out of memory, I was
wondering if I can use Spark or Flume to do this. Any help on this matter is
appreciated.
I prefer a application/process running inside
See: https://issues.apache.org/jira/browse/SPARK-3533
Feel free to comment there and make a case if you think the issue should be
reopened.
Nick
On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh
abhis...@tetrationanalytics.com wrote:
A workaround would be to have multiple passes on the RDD
Why do you need to use Spark or Flume for this?
You can just use curl and hdfs:
curl ftp://blah | hdfs dfs -put - /blah
On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar
varad...@yahoo.com.invalid wrote:
What is the best way to bring such a huge file from a FTP server into
Hadoop to
Thanks for the catch. Could you send a PR with this diff ?
On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex ashkure...@enova.com wrote:
Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897,
but with the Decimal datatype coming from a Postgres DB:
//Set up SparkR
ref: https://issues.apache.org/jira/browse/SPARK-9370
The code to handle BigInteger types in
org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters.java
and
org.apache.spark.unsafe.Platform.java
is dependant on the implementation of java.math.BigInteger
eg:
try {
Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897,
but with the Decimal datatype coming from a Postgres DB:
//Set up SparkR
Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark)
Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path
Created https://issues.apache.org/jira/browse/SPARK-9982, working on the PR
On Fri, Aug 14, 2015 at 12:43 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Thanks for the catch. Could you send a PR with this diff ?
On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex
I think that I'm still going to want some custom code to remove the build
start messages from SparkQA and it's hardly any code, so I'm going to stick
with the custom approach for now. The problem is that I don't want _any_
posts from AmplabJenkins, even if they're improved to be more informative,
Five month ago we reached 1 commits on GitHub. Today we reached 1
JIRA tickets.
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC
Hopefully the extra character we have to type doesn't bring our
productivity much.
Hi devs,
Jenkins failed twice in my PR https://github.com/apache/spark/pull/8216
for unknown error-
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40930/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40931/console
Can you help?
Thank you!
Well what do you do in case of failure?
I think one should use a professional ingestion tool that ideally does not
need to reload everything in case of failure and verifies that the file has
been transferred correctly via checksums.
I am not sure if Flume supports ftp, but Ssh,scp should be
Is it possible that you have only upgraded some set of nodes but not the
others?
We have ran some performance benchmarks on this so it definitely runs in
some configuration. Could still be buggy in some other configurations
though.
On Fri, Aug 14, 2015 at 6:37 AM, mkhaitman
The updated prototype listed in
https://github.com/databricks/spark-pr-dashboard/pull/59 is now running
live on spark-prs as part of its PR comment update task.
On Fri, Aug 14, 2015 at 10:51 AM, Josh Rosen rosenvi...@gmail.com wrote:
I think that I'm still going to want some custom code to
26 matches
Mail list logo