Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-10-06 Thread amarouni
You can get some more insights by using the Spark history server
(http://spark.apache.org/docs/latest/monitoring.html), it can show you
which task is failing and some other information that might help you
debugging the issue.


On 05/10/2016 19:00, Babak Alipour wrote:
> The issue seems to lie in the RangePartitioner trying to create equal
> ranges. [1]
>
> [1]
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.ml
> 
>  
>
>  The /Double/ values I'm trying to sort are mostly in the range [0,1]
> (~70% of the data which roughly equates 1 billion records), other
> numbers in the dataset are as high as 2000. With the RangePartitioner
> trying to create equal ranges, some tasks are becoming almost empty
> while others are extremely large, due to the heavily skewed distribution. 
>
> This is either a bug in Apache Spark or a major limitation of the
> framework. Has anyone else encountered this?
>
> */Babak Alipour ,/*
> */University of Florida/*
>
> On Sun, Oct 2, 2016 at 1:38 PM, Babak Alipour  > wrote:
>
> Thanks Vadim for sharing your experience, but I have tried
> multi-JVM setup (2 workers), various sizes for
> spark.executor.memory (8g, 16g, 20g, 32g, 64g) and
> spark.executor.core (2-4), same error all along.
>
> As for the files, these are all .snappy.parquet files, resulting
> from inserting some data from other tables. None of them actually
> exceeds 25MiB (I don't know why this number) Setting the DataFrame
> to persist using StorageLevel.MEMORY_ONLY shows size in memory at
> ~10g.  I still cannot understand why it is trying to create such a
> big page when sorting. The entire column (this df has only 1
> column) is not that big, neither are the original files. Any ideas?
>
>
> >Babak
>
>
>
> */Babak Alipour ,/*
> */University of Florida/*
>
> On Sun, Oct 2, 2016 at 1:45 AM, Vadim Semenov
> >
> wrote:
>
> oh, and try to run even smaller executors, i.e. with
> `spark.executor.memory` <= 16GiB. I wonder what result you're
> going to get.
>
> On Sun, Oct 2, 2016 at 1:24 AM, Vadim Semenov
>  > wrote:
>
> > Do you mean running a multi-JVM 'cluster' on the single
> machine? 
> Yes, that's what I suggested.
>
> You can get some information here: 
> 
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> 
> 
>
> > How would that affect performance/memory-consumption? If
> a multi-JVM setup can handle such a large input, then why
> can't a single-JVM break down the job into smaller tasks?
> I don't have an answer to these questions, it requires
> understanding of Spark, JVM, and your setup internal.
>
> I ran into the same issue only once when I tried to read a
> gzipped file which size was >16GiB. That's the only time I
> had to meet
> this 
> https://github.com/apache/spark/blob/5d84c7fd83502aeb551d46a740502db4862508fe/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L238-L243
> 
> 
> In the end I had to recompress my file into bzip2 that is
> splittable to be able to read it with spark.
>
>
> I'd look into size of your files and if they're huge I'd
> try to connect the error you got to the size of the files
> (but it's strange to me as a block size of a Parquet file
> is 128MiB). I don't have any other suggestions, I'm sorry.
>
>
> On Sat, Oct 1, 2016 at 11:35 PM, Babak Alipour
> >
> wrote:
>
> Do you mean running a multi-JVM 'cluster' on the
> single machine? How would that affect
> performance/memory-consumption? If a multi-JVM setup
> can handle such a large input, then why can't a
> single-JVM break down the job into smaller tasks?
>
> I also found that SPARK-9411 mentions making the
> page_size configurable but it's hard-limited
> to ((1L<<31) -1) *8L [1]
>
> [1]
> 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
> 
> 

Spark ML Interaction

2016-03-08 Thread amarouni
Hi,

Did anyone here manage to write an example of the following ML feature
transformer
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html
?
It's not documented on the official Spark ML features pages but it can
be found in the package API javadocs.

Thanks,

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dynamic jar loading

2015-12-17 Thread amarouni
Hello guys,

Do you know if the method SparkContext.addJar("file:///...") can be used
on a running context (an already started spark-shell) ?
And if so, does it add the jar to the class-path of the Spark workers
(Yarn containers in case of yarn-client) ?

Thanks,

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Database does not exist: (Spark-SQL ===> Hive)

2015-12-15 Thread amarouni
Can you test with latest version of spark ? I had the same issue with
1.3 and it was resolved 1.5.

On 15/12/2015 04:31, Jeff Zhang wrote:
> Do you put hive-site.xml on the classpath ?
>
> On Tue, Dec 15, 2015 at 11:14 AM, Gokula Krishnan D
> > wrote:
>
> Hello All - 
>
>
> I tried to execute a Spark-Scala Program in order to create a
> table in HIVE and faced couple of error so I just tried to execute
> the "show tables" and "show databases"
>
> And I have already created a database named "test_db".But I have
> encountered the error "Database does not exist"
>
> *Note: I do see couple of posts related to this error but nothing
> was helpful for me.*
>
> 
> =
> name := "ExploreSBT_V1"
>
> version := "1.0"
>
> scalaVersion := "2.11.5"
>
> libraryDependencies
> 
> ++=Seq("org.apache.spark"%%"spark-core"%"1.3.0","org.apache.spark"%%"spark-sql"%"1.3.0")
> libraryDependencies += "org.apache.spark"%%"spark-hive"%"1.3.0"
> 
> =
> Inline image 1
>
> Error: Encountered the following exceptions
> :org.apache.spark.sql.execution.QueryExecutionException: FAILED:
> Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. Database does not exist:
> test_db
> 15/12/14 18:49:57 ERROR HiveContext: 
> ==
> HIVE FAILURE OUTPUT
> ==   
>  
>  
>  
>  
>  
>  OK
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. Database does not exist:
> test_db
>
> ==
> END HIVE FAILURE OUTPUT
> ==
>   
>
> Process finished with exit code 0
>
> Thanks & Regards, 
> Gokula Krishnan*(Gokul)*
>
>
>
>
> -- 
> Best Regards
>
> Jeff Zhang



Re: Save RandomForest Model from ML package

2015-10-23 Thread amarouni

It's an open issue : https://issues.apache.org/jira/browse/SPARK-4587

That's being said, you can workaround the issue by serializing the Model
(simple java serialization) and then restoring it before calling the
predicition job.

Best Regards,

On 22/10/2015 14:33, Sebastian Kuepers wrote:
> Hey,
>
> I try to figure out the best practice on saving and loading models
> which have bin fitted with the ML package - i.e. with the RandomForest
> classifier.
>
> There is PMML support in the MLib package afaik but not in ML - is
> that correct?
>
> How do you approach this, so that you do not have to fit your model
> before every prediction job?
>
> Thanks,
> Sebastian
>
>
> Sebastian Küpers
> Account Director
>
> Publicis Pixelpark
> Leibnizstrasse 65, 10629 Berlin
> T +49 30 5058 1838
> M +49 172 389 28 52
> sebastian.kuep...@publicispixelpark.de
> Web: publicispixelpark.de, Twitter: @pubpxp
> Facebook: publicispixelpark.de/facebook
> Publicis Pixelpark - eine Marke der Pixelpark AG
> Vorstand: Horst Wagner (Vorsitzender), Dirk Kedrowitsch
> Aufsichtsratsvorsitzender: Pedro Simko
> Amtsgericht Charlottenburg: HRB 72163
>
>
>
>
>
> 
> Disclaimer The information in this email and any attachments may
> contain proprietary and confidential information that is intended for
> the addressee(s) only. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution, retention
> or use of the contents of this information is prohibited. When
> addressed to our clients or vendors, any information contained in this
> e-mail or any attachments is subject to the terms and conditions in
> any governing contract. If you have received this e-mail in error,
> please immediately contact the sender and delete the e-mail.