Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Yep, and it works fine for operations which does not involve any shuffle (like foreach,, count etc) and those which involves shuffle operations ends up in an infinite loop. Spark should somehow indicate this instead of going in an infinite loop. Thanks Best Regards On Thu, Aug 13, 2015 at 11:37

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Mridul Muralidharan
What I understood from Imran's mail (and what was referenced in his mail) the RDD mentioned seems to be violating some basic contracts on how partitions are used in spark [1]. They cannot be arbitrarily numbered,have duplicates, etc. Extending RDD to add functionality is typically for niche

Re: please help with ClassNotFoundException

2015-08-14 Thread 周千昊
Hi, Sea Problem solved, it turn out to be that I have updated spark cluster to 1.4.1, however the client has not been updated. Thank you so much. Sea 261810...@qq.com于2015年8月14日周五 下午1:01写道: I have no idea... We use scala. You upgrade to 1.4 so quickly..., are you using spark in

Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-14 Thread pishen tsai
Hello, I have written a sbt plugin called spark-deployer, which is able to deploy a standalone spark cluster on aws ec2 and submit jobs to it. https://github.com/pishen/spark-deployer Compared to current spark-ec2 script, this design may have several benefits (features): 1. All the code are

Re: Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-14 Thread mkhaitman
Has anyone had success using this preview? We were able to build the preview, and able to start the spark-master, however, unable to connect any spark workers to it. Kept receiving AkkaRpcEnv address in use while attempting to connect the spark-worker to the master. Also confirmed that the

avoid creating small objects

2015-08-14 Thread 周千昊
Hi, All I want to do is that, 1. read from some source 2. do some calculation to get some byte array 3. write the byte array to hdfs In hadoop, I can share an ImmutableByteWritable, and do some System.arrayCopy, it will prevent the application from creating a lot of small

Re: avoid creating small objects

2015-08-14 Thread 周千昊
I am thinking that creating a shared object outside the closure, use this object to hold the byte array. will this work? 周千昊 qhz...@apache.org于2015年8月14日周五 下午4:02写道: Hi, All I want to do is that, 1. read from some source 2. do some calculation to get some byte array 3. write

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Thanks for the clarifications Mrithul. Thanks Best Regards On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan mri...@gmail.com wrote: What I understood from Imran's mail (and what was referenced in his mail) the RDD mentioned seems to be violating some basic contracts on how partitions are

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-14 Thread pishen tsai
Sorry for previous line-breaking format, try to resend the mail again. I have written a sbt plugin called spark-deployer, which is able to deploy a standalone spark cluster on aws ec2 and submit jobs to it. https://github.com/pishen/spark-deployer Compared to current spark-ec2 script, this

Re: avoid creating small objects

2015-08-14 Thread Reynold Xin
You can use mapPartitions to do that. On Friday, August 14, 2015, 周千昊 qhz...@apache.org wrote: I am thinking that creating a shared object outside the closure, use this object to hold the byte array. will this work? 周千昊 qhz...@apache.org

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Iulian Dragoș
On Fri, Aug 14, 2015 at 4:21 AM, Josh Rosen rosenvi...@gmail.com wrote: Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59 On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen rosenvi...@gmail.com wrote: *TL;DR*: would anyone object if I wrote a script to auto-delete pull

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Abhishek R. Singh
A workaround would be to have multiple passes on the RDD and each pass write its own output? Or in a foreachPartition do it in a single pass (open up multiple files per partition to write out)? -Abhishek- On Aug 14, 2015, at 7:56 AM, Silas Davis si...@silasdavis.net wrote: Would it be right

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Reynold Xin
This is already supported with the new partitioned data sources in DataFrame/SQL right? On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com wrote: Speaking about Shopify's deployment, this would be a really nice to have feature. We would like to write data to folders

Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Varadhan, Jawahar
What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated.  I prefer a application/process running inside

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Nicholas Chammas
See: https://issues.apache.org/jira/browse/SPARK-3533 Feel free to comment there and make a case if you think the issue should be reopened. Nick On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh abhis...@tetrationanalytics.com wrote: A workaround would be to have multiple passes on the RDD

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Marcelo Vanzin
Why do you need to use Spark or Flume for this? You can just use curl and hdfs: curl ftp://blah | hdfs dfs -put - /blah On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar varad...@yahoo.com.invalid wrote: What is the best way to bring such a huge file from a FTP server into Hadoop to

Re: SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shivaram Venkataraman
Thanks for the catch. Could you send a PR with this diff ? On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex ashkure...@enova.com wrote: Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but with the Decimal datatype coming from a Postgres DB: //Set up SparkR

Reliance on java.math.BigInteger implementation

2015-08-14 Thread Pete Robbins
ref: https://issues.apache.org/jira/browse/SPARK-9370 The code to handle BigInteger types in org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters.java and org.apache.spark.unsafe.Platform.java is dependant on the implementation of java.math.BigInteger eg: try {

SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shkurenko, Alex
Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but with the Decimal datatype coming from a Postgres DB: //Set up SparkR Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark) Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path

Re: SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shkurenko, Alex
Created https://issues.apache.org/jira/browse/SPARK-9982, working on the PR On Fri, Aug 14, 2015 at 12:43 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks for the catch. Could you send a PR with this diff ? On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen
I think that I'm still going to want some custom code to remove the build start messages from SparkQA and it's hardly any code, so I'm going to stick with the custom approach for now. The problem is that I don't want _any_ posts from AmplabJenkins, even if they're improved to be more informative,

SPARK-10000 + now

2015-08-14 Thread Reynold Xin
Five month ago we reached 1 commits on GitHub. Today we reached 1 JIRA tickets. https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC Hopefully the extra character we have to type doesn't bring our productivity much.

Jenkins having issues?

2015-08-14 Thread Cheolsoo Park
Hi devs, Jenkins failed twice in my PR https://github.com/apache/spark/pull/8216 for unknown error- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40930/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40931/console Can you help? Thank you!

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Jörn Franke
Well what do you do in case of failure? I think one should use a professional ingestion tool that ideally does not need to reload everything in case of failure and verifies that the file has been transferred correctly via checksums. I am not sure if Flume supports ftp, but Ssh,scp should be

Re: Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-14 Thread Reynold Xin
Is it possible that you have only upgraded some set of nodes but not the others? We have ran some performance benchmarks on this so it definitely runs in some configuration. Could still be buggy in some other configurations though. On Fri, Aug 14, 2015 at 6:37 AM, mkhaitman

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen
The updated prototype listed in https://github.com/databricks/spark-pr-dashboard/pull/59 is now running live on spark-prs as part of its PR comment update task. On Fri, Aug 14, 2015 at 10:51 AM, Josh Rosen rosenvi...@gmail.com wrote: I think that I'm still going to want some custom code to