driver memory management

2014-09-28 Thread Brad Miller
Hi All, I am interested to collect() a large RDD so that I can run a learning algorithm on it. I've noticed that when I don't increase SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it looks like the same fraction of memory is reserved for storage on the driver as on the work

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-26 Thread Brad Miller
ads(self, obj): > > OverflowError: size does not fit in an int ***BLOCK 10** [ERROR 1]* check_pre_serialized(30) ...same as above... ***BLOCK 11** [ERROR 3]* check_unserialized(30) ...same as above... On Thu, Sep 25, 2014 at 2:55 PM, Davies Liu wrote: > > On Thu, Sep 25, 2014 at 11:2

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
> > Arun > > On Fri, Sep 26, 2014 at 1:32 PM, Brad Miller > wrote: > >> I've had multiple jobs crash due to "java.io.IOException: unexpected >> exception type"; I've been running the 1.1 branch for some time and am now >> running the 1.

Re: java.io.IOException Error in task deserialization

2014-09-26 Thread Brad Miller
I've had multiple jobs crash due to "java.io.IOException: unexpected exception type"; I've been running the 1.1 branch for some time and am now running the 1.1 release binaries. Note that I only use PySpark. I haven't kept detailed notes or the tracebacks around since there are other problems that

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-25 Thread Brad Miller
k said that the serialized closure cannot be parsed (base64) > >> correctly by py4j. > >> > >> The string in Java cannot be longer than 2G, so the serialized closure > >> cannot longer than 1.5G (there are overhead in base64), is it possible > >> that your dat

java.lang.NegativeArraySizeException in pyspark

2014-09-20 Thread Brad Miller
Hi All, I'm experiencing a java.lang.NegativeArraySizeException in a pyspark script I have. I've pasted the full traceback at the end of this email. I have isolated the line of code in my script which "causes" the exception to occur. Although the exception seems to occur deterministically, it is

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller
Hi All, I suspect I am experiencing a bug. I've noticed that while running larger jobs, they occasionally die with the exception "java.util.NoSuchElementException: key not found xyz", where "xyz" denotes the ID of some particular task. I've excerpted the log from one job that died in this way bel

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew, I agree with Nicholas. That was a nice, concise summary of the meaning of the locality customization options, indicators and default Spark behaviors. I haven't combed through the documentation end-to-end in a while, but I'm also not sure that information is presently represented somew

Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Brad Miller
e next bugfix release, you can workaround this by: > > srdd = sqlCtx.jsonRDD(rdd) > srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx) > > > On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller > wrote: > > Hi All, > > > > I'm having some trouble

coalesce on SchemaRDD in pyspark

2014-09-11 Thread Brad Miller
Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace: py4j.Py4JEx

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
back a table in a SQLContext. On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen wrote: > Hi Brad, > > When you do the conversion is this a Hive/Spark job or is it a > pre-processing step before loading into HDFS? > > ---Ben > > > On Fri, Sep 5, 2014 at 10:29 AM, Br

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
My approach may be partly influenced by my limited experience with SQL and Hive, but I just converted all my dates to seconds-since-epoch and then selected samples from specific time ranges using integer comparisons. On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao wrote: > There are 2 SQL dialects,

/tmp/spark-events permissions problem

2014-08-29 Thread Brad Miller
Hi All, Yesterday I restarted my cluster, which had the effect of clearing /tmp. When I brought Spark back up and ran my first job, /tmp/spark-events was re-created and the job ran fine. I later learned that other users were receiving errors when trying to create a spark context. It turned out

Re: Spark webUI - application details page

2014-08-29 Thread Brad Miller
How did you specify the HDFS path? When i put spark.eventLog.dir hdfs:// crosby.research.intel-research.net:54310/tmp/spark-events in my spark-defaults.conf file, I receive the following error: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.io.IOEx

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller
Hi All, @Andrew Thanks for the tips. I just built the master branch of Spark last night, but am still having problems viewing history through the standalone UI. I dug into the Spark job events directories as you suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and 'EVENT_LOG_1'; for appli

Re: Spark webUI - application details page

2014-08-15 Thread Brad Miller
Hi Andrew, I'm running something close to the present master (I compiled several days ago) but am having some trouble viewing history. I set "spark.eventLog.dir" to true, but continually receive the error message (via the web UI) "Application history not found...No event logs found for applicatio

SPARK_LOCAL_DIRS

2014-08-14 Thread Brad Miller
Hi All, I'm having some trouble setting the disk spill directory for spark. The following approaches set "spark.local.dir" (according to the "Environment" tab of the web UI) but produce the indicated warnings: *In spark-env.sh:* export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill *Associated w

SPARK_DRIVER_MEMORY

2014-08-14 Thread Brad Miller
Hi All, I have a Spark job for which I need to increase the amount of memory allocated to the driver to collect a large-ish (>200M) data structure. Formerly, I accomplished this by setting SPARK_MEM before invoking my job (which effectively set memory on the driver) and then setting spark.executor

Re: trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai wrote: > Hi Brad, > > It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 > to track it. It will be fixed soon. > > Thanks, > > Yin > > > On Thu, Aug 7, 2014 at 10:55 AM,

trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Hi All, I'm having a bit of trouble with nested data structures in pyspark with saveAsParquetFile. I'm running master (as of yesterday) with this pull request added: https://github.com/apache/spark/pull/1802. *# these all work* > sqlCtx.jsonRDD(sc.parallelize(['{"record": null}'])).saveAsParquet

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
ack to the Python side, SchemaRDD#javaToPython > failed on your cases. I have created > https://issues.apache.org/jira/browse/SPARK-2875 to track it. > > Thanks, > > Yin > > > On Tue, Aug 5, 2014 at 9:20 PM, Brad Miller > wrote: > >> Hi All, >> >&g

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
gt; figure out the proper schema. We will take a look it. > > Thanks, > > Yin > > > On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller > wrote: > >> Assuming updating to master fixes the bug I was experiencing with jsonRDD >> and jsonFile, then pushing "sample&qu

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
bugs / adding features very rapidly compared with Spark core. > > branch-1.1 was just cut and is being QAed for a release, at this point its > likely the same as master, but that will change as features start getting > added to master in the coming weeks. > > > > On Tue, Aug 5

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
[4,5,6]]}', > '{"foo":[[1,2,3], [4,5,6]]}'])).printSchema() > root > |-- foo: array (nullable = true) > ||-- element: array (containsNull = false) > |||-- element: integer (containsNull = false) > > >>> > > Nick > ​ > > > On

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
nlikely to be fixed in master either. I previously tried master as well, but ran into a build problem that did not occur with the 1.0 branch. Can anybody else verify that the second example still crashes (and is meant to work)? If so, would it be best to modify JIRA-2376 or start a new bug? https://iss

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
ra/browse/SPARK-2376 best, -brad On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu wrote: > This "sample" argument of inferSchema is still no in master, if will > try to add it if it make > sense. > > On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller > wrote: > > Hi Davi

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
top level dictionary is try to solve this kind of > inconsistance. > > The Row class in pyspark.sql has a similar interface to dict, so you > can easily convert > you dic into a Row: > > ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d))) > > In order to get the correct s

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
on master or the 1.1-RC which > should be coming out this week. Pyspark did not have good support for > nested data previously. If you still encounter issues using a more recent > version, please file a JIRA. Thanks! > > > On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller &

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Got it. Thanks! On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Notice the difference in the schema. Are you running the 1.0.1 release, >> or a more bleeding-edge version from the repository? > > Yep, my bad. I’m running off master at commit > 184048f80

trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of some JSON data I have, but I've run into some instability involving the following java exception: An error occurred while calling o1326.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Ta

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
f doing so > from RDDs of Rows. > > I’m not sure what the intention behind this move is, but as a user I’d > like to be able to convert RDDs of dictionaries directly to SchemaRDDs with > the completeness of the jsonRDD()/jsonFile() methods. Right now if I > really want that,

pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi All, I have a data set where each record is serialized using JSON, and I'm interested to use SchemaRDDs to work with the data. Unfortunately I've hit a snag since some fields in the data are maps and list, and are not guaranteed to be populated for each record. This seems to cause inferSchema

Re: Fwd: pyspark crash on mesos

2014-08-01 Thread Brad Miller
Hi Jia, Unfortunately, I did not ever find a solution. More recently, I tried running Spark 1.0.0 with (I think) Mesos 0.18.0; that likewise had stability issues. I've decided to make my peace with Standalone mode for now. -Brad On Thu, Jul 31, 2014 at 7:29 PM, daijia wrote: > I met the sam

Re: Read all the columns from a file in spark sql

2014-07-17 Thread Brad Miller
Hi Pandees, You may also be helped by looking into the ability to read and write Parquet files which is available in the present release. Parquet files allow you to store columnar data in HDFS. At present, Spark "infers" the schema from the Parquet file. In pyspark, some of the methods you'd be

Re: Announcing Spark 1.0.1

2014-07-12 Thread Brad Miller
Hi All, Congrats to the entire Spark team on the 1.0.1 release. In checking out the new features, I noticed that it looks like the python API docs have been updated, but the and the header at the top of the page still say "Spark 1.0.0". Clearly not a big deal... I just wouldn't want anyone to g

odd caching behavior or accounting

2014-06-30 Thread Brad Miller
he archive. best, -Brad -- Forwarded message -- From: Brad Miller Date: Mon, Jun 30, 2014 at 10:20 AM Subject: odd caching behavior or accounting To: user@spark.apache.org Hi All, I've recently noticed some caching behavior which I did not understand and may or may not ha

pyspark bug with unittest and scikit-learn

2014-06-19 Thread Brad Miller
Hi All, I am attempting to develop some unit tests for a program using pyspark and scikit-learn and I've come across some weird behavior. I receive the following warning during some tests "python/pyspark/serializers.py:327: DeprecationWarning: integer argument expected, got float". Although it's

Re: pyspark join crash

2014-06-04 Thread Brad Miller
/issues.apache.org/jira/browse/SPARK-2021 to track > this — it’s something we’ve been meaning to look at soon. > > Matei > > On Jun 4, 2014, at 8:23 AM, Brad Miller > wrote: > > > Hi All, > > > > I have experienced some crashing behavior with join in pyspark.

pyspark join crash

2014-06-04 Thread Brad Miller
Hi All, I have experienced some crashing behavior with join in pyspark. When I attempt a join with 2000 partitions in the result, the join succeeds, but when I use only 200 partitions in the result, the join fails with the message "Job aborted due to stage failure: Master removed our application:

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
> 4. Shuffle on disk > Is it true - I couldn't find it in official docs, but did see this mentioned > in various threads - that shuffle _always_ hits disk? (Disregarding OS > caches.) Why is this the case? Are you planning to add a function to do > shuffle in memory or are there some intrinsic reas

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
I would echo much of what Andrew has said. I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk space dedicated to spark, data storage in separate HDFS shares). I've been using spark since 0.7, and as with Andrew I've observed significant and consistent improvements in stability (

Re: trouble with "join" on large RDDs

2014-04-09 Thread Brad Miller
python > Apr 8 11:19:19 bennett kernel: [86368.978386] Out of memory: Kill process > 15198 (python) score 221 or sacrifice child > Apr 8 11:19:19 bennett kernel: [86368.978389] Killed process 15198 (python) > total-vm:7381884kB, anon-rss:7273852kB, file-rss:0kB > > > On Tue, Apr 8

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Brad Miller
If you're using pyspark, beware that there are some known issues associated with large broadcast variables. https://spark-project.atlassian.net/browse/SPARK-1065 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/browser -Brad On Wed, Mar 12, 2014 at 10:15 AM, Guillaume Pitel < gui

pyspark broadcast error

2014-03-10 Thread Brad Miller
Hi All, When I run the program shown below, I receive the error shown below. I am running the current version of branch-0.9 from github. Note that I do not receive the error when I replace "2 ** 29" with "2 ** X", where X < 29. More interestingly, I do not receive the error when X = 30, and when