Re: distinct on huge dataset

2014-04-17 Thread Mayur Rustagi
Preferably increase the ulimit on your machines. Spark needs to access a lot of 
small files hence hard to control file handlers. 


—
Sent from Mailbox

On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton 
wrote:

> Btw, I've got System.setProperty("spark.shuffle.consolidate.files",
> "true") and use ext3 (CentOS...)
> On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton  wrote:
>> Does this continue in newer versions? (I'm on 0.8.0 now)
>>
>> When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
>> I'm guessing about 500M are distinct) my jobs fail with:
>>
>> 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
>> java.io.FileNotFoundException
>> java.io.FileNotFoundException:
>> /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many
>> open files)
>>
>> ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a
>> worker node during a failed .distinct():
>> http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark
>> tries to open 32000 files.
>>
>> I never ran into this in 0.7.3. Is there a parameter I can set to tell
>> Spark to use less than 32000 files?
>>
>> On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson  wrote:
>>> Look up setting ulimit, though note the distinction between soft and hard
>>> limits, and that updating your hard limit may require changing
>>> /etc/security/limits.confand restarting each worker.
>>>
>>>
>>> On Mon, Mar 24, 2014 at 1:39 AM, Kane  wrote:
>>>>
>>>> Got a bit further, i think out of memory error was caused by setting
>>>> spark.spill to false. Now i have this error, is there an easy way to
>>>> increase file limit for spark, cluster-wide?:
>>>>
>>>> java.io.FileNotFoundException:
>>>>
>>>> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
>>>> (Too many open files)
>>>> at java.io.FileOutputStream.openAppend(Native Method)
>>>> at java.io.FileOutputStream.(FileOutputStream.java:192)
>>>> at
>>>>
>>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
>>>> at
>>>>
>>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
>>>> at
>>>>
>>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
>>>> at
>>>>
>>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
>>>> at
>>>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>>> at
>>>>
>>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>>> at
>>>>
>>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
>>>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>>>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>>>> at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>>> at
>>>>
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>>>> at
>>>>
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>>>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>>> at
>>>>
>>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>>> at
>>>>
>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>>> at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>>> at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> at java.lang.Thread.run(Thread.java:662)
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty("spark.shuffle.consolidate.files",
"true") and use ext3 (CentOS...)

On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton  wrote:
> Does this continue in newer versions? (I'm on 0.8.0 now)
>
> When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
> I'm guessing about 500M are distinct) my jobs fail with:
>
> 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
> java.io.FileNotFoundException
> java.io.FileNotFoundException:
> /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many
> open files)
>
> ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a
> worker node during a failed .distinct():
> http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark
> tries to open 32000 files.
>
> I never ran into this in 0.7.3. Is there a parameter I can set to tell
> Spark to use less than 32000 files?
>
> On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson  wrote:
>> Look up setting ulimit, though note the distinction between soft and hard
>> limits, and that updating your hard limit may require changing
>> /etc/security/limits.confand restarting each worker.
>>
>>
>> On Mon, Mar 24, 2014 at 1:39 AM, Kane  wrote:
>>>
>>> Got a bit further, i think out of memory error was caused by setting
>>> spark.spill to false. Now i have this error, is there an easy way to
>>> increase file limit for spark, cluster-wide?:
>>>
>>> java.io.FileNotFoundException:
>>>
>>> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
>>> (Too many open files)
>>> at java.io.FileOutputStream.openAppend(Native Method)
>>> at java.io.FileOutputStream.(FileOutputStream.java:192)
>>> at
>>>
>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
>>> at
>>>
>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
>>> at
>>>
>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
>>> at
>>>
>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
>>> at
>>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>> at
>>>
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>> at
>>>
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
>>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>>> at
>>>
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>>> at
>>>
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>>> at
>>>
>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>>     at
>>>
>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> at java.lang.Thread.run(Thread.java:662)
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>


Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now)

When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
I'm guessing about 500M are distinct) my jobs fail with:

14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException:
/tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many
open files)

ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a
worker node during a failed .distinct():
http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark
tries to open 32000 files.

I never ran into this in 0.7.3. Is there a parameter I can set to tell
Spark to use less than 32000 files?

On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson  wrote:
> Look up setting ulimit, though note the distinction between soft and hard
> limits, and that updating your hard limit may require changing
> /etc/security/limits.confand restarting each worker.
>
>
> On Mon, Mar 24, 2014 at 1:39 AM, Kane  wrote:
>>
>> Got a bit further, i think out of memory error was caused by setting
>> spark.spill to false. Now i have this error, is there an easy way to
>> increase file limit for spark, cluster-wide?:
>>
>> java.io.FileNotFoundException:
>>
>> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
>> (Too many open files)
>> at java.io.FileOutputStream.openAppend(Native Method)
>> at java.io.FileOutputStream.(FileOutputStream.java:192)
>> at
>>
>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
>> at
>>
>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
>> at
>>
>> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
>> at
>>
>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
>> at
>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>> at
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>> at
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>> at
>>
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>> at
>>
>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>     at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: distinct on huge dataset

2014-03-24 Thread Aaron Davidson
Look up setting ulimit, though note the distinction between soft and hard
limits, and that updating your hard limit may require
changing /etc/security/limits.confand restarting each worker.


On Mon, Mar 24, 2014 at 1:39 AM, Kane  wrote:

> Got a bit further, i think out of memory error was caused by setting
> spark.spill to false. Now i have this error, is there an easy way to
> increase file limit for spark, cluster-wide?:
>
> java.io.FileNotFoundException:
>
> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
> (Too many open files)
> at java.io.FileOutputStream.openAppend(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:192)
> at
>
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
> at
>
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
> at
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
> at
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
> at
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.java:662)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: distinct on huge dataset

2014-03-24 Thread Kane
Got a bit further, i think out of memory error was caused by setting
spark.spill to false. Now i have this error, is there an easy way to
increase file limit for spark, cluster-wide?:

java.io.FileNotFoundException:
/tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2
(Too many open files)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:192)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191)
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
at
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Ah, interesting. count() without distinct is streaming and does not require
that a single partition fits in memory, for instance. That said, the
behavior may change if you increase the number of partitions in your input
RDD by using RDD.repartition()


On Sun, Mar 23, 2014 at 11:47 AM, Kane  wrote:

> Yes, there was an error in data, after fixing it - count fails with Out of
> Memory Error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of
Memory Error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash
collision error we found there.

Kane, is it possible your bigger data is corrupt, such that that any
operations on it fail?


On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash  wrote:

> FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
> have it disabled now. The specific error behavior was that a join would
> consistently return one count of rows with spill enabled and another count
> with it disabled.
>
> Sent from my mobile phone
> On Mar 22, 2014 1:52 PM, "Kane"  wrote:
>
>> But i was wrong - map also fails on big file and setting
>> spark.shuffle.spill
>> doesn't help. Map fails with the same error.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>


Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
have it disabled now. The specific error behavior was that a join would
consistently return one count of rows with spill enabled and another count
with it disabled.

Sent from my mobile phone
On Mar 22, 2014 1:52 PM, "Kane"  wrote:

> But i was wrong - map also fails on big file and setting
> spark.shuffle.spill
> doesn't help. Map fails with the same error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill
doesn't help. Map fails with the same error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in
0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045

You might try setting spark.shuffle.spill to false and see if that runs any
longer (turning off shuffle spill is dangerous, though, as it may cause
Spark to OOM if your reduce partitions are too large).



On Sat, Mar 22, 2014 at 10:00 AM, Kane  wrote:

> I mean everything works with the small file. With huge file only count and
> map work, distinct - doesn't
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: distinct on huge dataset

2014-03-22 Thread Kane
I mean everything works with the small file. With huge file only count and
map work, distinct - doesn't



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Mayur Rustagi
Does it work on a smaller file?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Sat, Mar 22, 2014 at 4:50 AM, Ryan Compton wrote:

> Does it work without .distinct() ?
>
> Possibly related issue I ran into:
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E
>
> On Sat, Mar 22, 2014 at 12:45 AM, Kane  wrote:
> > It's 0.9.0
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: distinct on huge dataset

2014-03-22 Thread Ryan Compton
Does it work without .distinct() ?

Possibly related issue I ran into:
https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E

On Sat, Mar 22, 2014 at 12:45 AM, Kane  wrote:
> It's 0.9.0
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
It's 0.9.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-21 Thread Aaron Davidson
Which version of spark are you running?


On Fri, Mar 21, 2014 at 10:45 PM, Kane  wrote:

> I have a huge 2.5TB file. When i run:
> val t = sc.textFile("/user/hdfs/dump.csv")
> t.distinct.count
>
> It fails right away with a lot of:
>
> Loss was due to java.lang.ArrayIndexOutOfBoundsException
> java.lang.ArrayIndexOutOfBoundsException: 1
> at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:18)
> at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:16)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


distinct on huge dataset

2014-03-21 Thread Kane
I have a huge 2.5TB file. When i run:
val t = sc.textFile("/user/hdfs/dump.csv")
t.distinct.count

It fails right away with a lot of:

Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 1
at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:18)
at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:16)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.