Hi All,
I am interested to collect() a large RDD so that I can run a learning
algorithm on it. I've noticed that when I don't increase
SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it
looks like the same fraction of memory is reserved for storage on the
driver as on the work
ads(self, obj):
>
> OverflowError: size does not fit in an int
***BLOCK 10** [ERROR 1]*
check_pre_serialized(30)
...same as above...
***BLOCK 11** [ERROR 3]*
check_unserialized(30)
...same as above...
On Thu, Sep 25, 2014 at 2:55 PM, Davies Liu wrote:
>
> On Thu, Sep 25, 2014 at 11:2
>
> Arun
>
> On Fri, Sep 26, 2014 at 1:32 PM, Brad Miller
> wrote:
>
>> I've had multiple jobs crash due to "java.io.IOException: unexpected
>> exception type"; I've been running the 1.1 branch for some time and am now
>> running the 1.
I've had multiple jobs crash due to "java.io.IOException: unexpected
exception type"; I've been running the 1.1 branch for some time and am now
running the 1.1 release binaries. Note that I only use PySpark. I haven't
kept detailed notes or the tracebacks around since there are other problems
that
k said that the serialized closure cannot be parsed (base64)
> >> correctly by py4j.
> >>
> >> The string in Java cannot be longer than 2G, so the serialized closure
> >> cannot longer than 1.5G (there are overhead in base64), is it possible
> >> that your dat
Hi All,
I'm experiencing a java.lang.NegativeArraySizeException in a pyspark script
I have. I've pasted the full traceback at the end of this email.
I have isolated the line of code in my script which "causes" the exception
to occur. Although the exception seems to occur deterministically, it is
Hi All,
I suspect I am experiencing a bug. I've noticed that while running
larger jobs, they occasionally die with the exception
"java.util.NoSuchElementException: key not found xyz", where "xyz"
denotes the ID of some particular task. I've excerpted the log from
one job that died in this way bel
Hi Andrew,
I agree with Nicholas. That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors. I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented somew
e next bugfix release, you can workaround this by:
>
> srdd = sqlCtx.jsonRDD(rdd)
> srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)
>
>
> On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller
> wrote:
> > Hi All,
> >
> > I'm having some trouble
Hi All,
I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark. When I run:
sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
'{"foo":"baz"}'])).coalesce(1)
I get this error:
Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JEx
back a table in a SQLContext.
On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen wrote:
> Hi Brad,
>
> When you do the conversion is this a Hive/Spark job or is it a
> pre-processing step before loading into HDFS?
>
> ---Ben
>
>
> On Fri, Sep 5, 2014 at 10:29 AM, Br
My approach may be partly influenced by my limited experience with SQL and
Hive, but I just converted all my dates to seconds-since-epoch and then
selected samples from specific time ranges using integer comparisons.
On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao wrote:
> There are 2 SQL dialects,
Hi All,
Yesterday I restarted my cluster, which had the effect of clearing /tmp.
When I brought Spark back up and ran my first job, /tmp/spark-events was
re-created and the job ran fine. I later learned that other users were
receiving errors when trying to create a spark context. It turned out
How did you specify the HDFS path? When i put
spark.eventLog.dir hdfs://
crosby.research.intel-research.net:54310/tmp/spark-events
in my spark-defaults.conf file, I receive the following error:
An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOEx
Hi All,
@Andrew
Thanks for the tips. I just built the master branch of Spark last
night, but am still having problems viewing history through the
standalone UI. I dug into the Spark job events directories as you
suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and
'EVENT_LOG_1'; for appli
Hi Andrew,
I'm running something close to the present master (I compiled several days
ago) but am having some trouble viewing history.
I set "spark.eventLog.dir" to true, but continually receive the error
message (via the web UI) "Application history not found...No event logs
found for applicatio
Hi All,
I'm having some trouble setting the disk spill directory for spark. The
following approaches set "spark.local.dir" (according to the "Environment"
tab of the web UI) but produce the indicated warnings:
*In spark-env.sh:*
export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill
*Associated w
Hi All,
I have a Spark job for which I need to increase the amount of memory
allocated to the driver to collect a large-ish (>200M) data structure.
Formerly, I accomplished this by setting SPARK_MEM before invoking my
job (which effectively set memory on the driver) and then setting
spark.executor
Thanks Yin!
best,
-Brad
On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai wrote:
> Hi Brad,
>
> It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908
> to track it. It will be fixed soon.
>
> Thanks,
>
> Yin
>
>
> On Thu, Aug 7, 2014 at 10:55 AM,
Hi All,
I'm having a bit of trouble with nested data structures in pyspark with
saveAsParquetFile. I'm running master (as of yesterday) with this pull
request added: https://github.com/apache/spark/pull/1802.
*# these all work*
> sqlCtx.jsonRDD(sc.parallelize(['{"record":
null}'])).saveAsParquet
ack to the Python side, SchemaRDD#javaToPython
> failed on your cases. I have created
> https://issues.apache.org/jira/browse/SPARK-2875 to track it.
>
> Thanks,
>
> Yin
>
>
> On Tue, Aug 5, 2014 at 9:20 PM, Brad Miller
> wrote:
>
>> Hi All,
>>
>&g
gt; figure out the proper schema. We will take a look it.
>
> Thanks,
>
> Yin
>
>
> On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller
> wrote:
>
>> Assuming updating to master fixes the bug I was experiencing with jsonRDD
>> and jsonFile, then pushing "sample&qu
bugs / adding features very rapidly compared with Spark core.
>
> branch-1.1 was just cut and is being QAed for a release, at this point its
> likely the same as master, but that will change as features start getting
> added to master in the coming weeks.
>
>
>
> On Tue, Aug 5
[4,5,6]]}',
> '{"foo":[[1,2,3], [4,5,6]]}'])).printSchema()
> root
> |-- foo: array (nullable = true)
> ||-- element: array (containsNull = false)
> |||-- element: integer (containsNull = false)
>
> >>>
>
> Nick
>
>
>
> On
nlikely to be fixed in master either. I previously tried
master as well, but ran into a build problem that did not occur with the
1.0 branch.
Can anybody else verify that the second example still crashes (and is meant
to work)? If so, would it be best to modify JIRA-2376 or start a new bug?
https://iss
ra/browse/SPARK-2376
best,
-brad
On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu wrote:
> This "sample" argument of inferSchema is still no in master, if will
> try to add it if it make
> sense.
>
> On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller
> wrote:
> > Hi Davi
top level dictionary is try to solve this kind of
> inconsistance.
>
> The Row class in pyspark.sql has a similar interface to dict, so you
> can easily convert
> you dic into a Row:
>
> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))
>
> In order to get the correct s
on master or the 1.1-RC which
> should be coming out this week. Pyspark did not have good support for
> nested data previously. If you still encounter issues using a more recent
> version, please file a JIRA. Thanks!
>
>
> On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller
&
Got it. Thanks!
On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
> Notice the difference in the schema. Are you running the 1.0.1 release,
>> or a more bleeding-edge version from the repository?
>
> Yep, my bad. I’m running off master at commit
> 184048f80
Hi All,
I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of
some JSON data I have, but I've run into some instability involving the
following java exception:
An error occurred while calling o1326.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Ta
f doing so
> from RDDs of Rows.
>
> I’m not sure what the intention behind this move is, but as a user I’d
> like to be able to convert RDDs of dictionaries directly to SchemaRDDs with
> the completeness of the jsonRDD()/jsonFile() methods. Right now if I
> really want that,
Hi All,
I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data. Unfortunately I've hit
a snag since some fields in the data are maps and list, and are not
guaranteed to be populated for each record. This seems to cause
inferSchema
Hi Jia,
Unfortunately, I did not ever find a solution. More recently, I tried
running Spark 1.0.0 with (I think) Mesos 0.18.0; that likewise had
stability issues. I've decided to make my peace with Standalone mode for
now.
-Brad
On Thu, Jul 31, 2014 at 7:29 PM, daijia wrote:
> I met the sam
Hi Pandees,
You may also be helped by looking into the ability to read and write
Parquet files which is available in the present release. Parquet
files allow you to store columnar data in HDFS. At present, Spark
"infers" the schema from the Parquet file. In pyspark, some of the
methods you'd be
Hi All,
Congrats to the entire Spark team on the 1.0.1 release.
In checking out the new features, I noticed that it looks like the
python API docs have been updated, but the and the header at
the top of the page still say "Spark 1.0.0". Clearly not a big
deal... I just wouldn't want anyone to g
he archive.
best,
-Brad
-- Forwarded message --
From: Brad Miller
Date: Mon, Jun 30, 2014 at 10:20 AM
Subject: odd caching behavior or accounting
To: user@spark.apache.org
Hi All,
I've recently noticed some caching behavior which I did not understand
and may or may not ha
Hi All,
I am attempting to develop some unit tests for a program using pyspark and
scikit-learn and I've come across some weird behavior. I receive the
following warning during some tests "python/pyspark/serializers.py:327:
DeprecationWarning: integer argument expected, got float".
Although it's
/issues.apache.org/jira/browse/SPARK-2021 to track
> this — it’s something we’ve been meaning to look at soon.
>
> Matei
>
> On Jun 4, 2014, at 8:23 AM, Brad Miller
> wrote:
>
> > Hi All,
> >
> > I have experienced some crashing behavior with join in pyspark.
Hi All,
I have experienced some crashing behavior with join in pyspark. When I
attempt a join with 2000 partitions in the result, the join succeeds, but
when I use only 200 partitions in the result, the join fails with the
message "Job aborted due to stage failure: Master removed our application:
> 4. Shuffle on disk
> Is it true - I couldn't find it in official docs, but did see this mentioned
> in various threads - that shuffle _always_ hits disk? (Disregarding OS
> caches.) Why is this the case? Are you planning to add a function to do
> shuffle in memory or are there some intrinsic reas
I would echo much of what Andrew has said.
I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk
space dedicated to spark, data storage in separate HDFS shares). I've
been using spark since 0.7, and as with Andrew I've observed
significant and consistent improvements in stability (
python
> Apr 8 11:19:19 bennett kernel: [86368.978386] Out of memory: Kill process
> 15198 (python) score 221 or sacrifice child
> Apr 8 11:19:19 bennett kernel: [86368.978389] Killed process 15198 (python)
> total-vm:7381884kB, anon-rss:7273852kB, file-rss:0kB
>
>
> On Tue, Apr 8
If you're using pyspark, beware that there are some known issues associated
with large broadcast variables.
https://spark-project.atlassian.net/browse/SPARK-1065
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/browser
-Brad
On Wed, Mar 12, 2014 at 10:15 AM, Guillaume Pitel <
gui
Hi All,
When I run the program shown below, I receive the error shown below.
I am running the current version of branch-0.9 from github. Note that
I do not receive the error when I replace "2 ** 29" with "2 ** X",
where X < 29. More interestingly, I do not receive the error when X =
30, and when
44 matches
Mail list logo