Re: ORC file writing hangs in pyspark

2016-02-23 Thread Zhan Zhang
Hi James, You can try to write with other format, e.g., parquet to see whether it is a orc specific issue or more generic issue. Thanks. Zhan Zhang On Feb 23, 2016, at 6:05 AM, James Barney > wrote: I'm trying to write an ORC file

Re: ORC file writing hangs in pyspark

2016-02-23 Thread Jeff Zhang
Have you checked the live spark UI and yarn app logs ? On Tue, Feb 23, 2016 at 10:05 PM, James Barney wrote: > I'm trying to write an ORC file after running the FPGrowth algorithm on a > dataset of around just 2GB in size. The algorithm performs well and can > display

Spark Job on YARN Hogging the entire Cluster resource

2016-02-23 Thread Prabhu Joseph
Hi All, A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler with root queue having 230 queues. Each Queue is configured with maxResources equal to Total Cluster Resource. When a Spark job is submitted into a queue A, it is given with 10TB, 3000 cores according to

Re: Using Encoding to reduce GraphX's static graph memory consumption

2016-02-23 Thread Joseph E. Gonzalez
Actually another improvement would be to use something like compressed sparse row encoding which can be used to store A and A^T relatively efficiently (I think using 5 arrays instead of 6). There is an option to also be more cache aware using something like a block compressed sparse row

Fwd: HANA data access from SPARK

2016-02-23 Thread Dushyant Rajput
Hi, I am writting a python app to load data from SAP HANA. dfr = DataFrameReader(sqlContext) df = dfr.jdbc(url='jdbc:sap://ip_hana:30015/?user==',table=table) df.show() It throws a ​ serialization error​ : y4j.protocol.Py4JJavaError: An error occurred while calling o59.showString. :

Re: Using Encoding to reduce GraphX's static graph memory consumption

2016-02-23 Thread Adnan Haider
Hi I have created a jira for this issue here. As for the pull request, my implementation is based on removing localSrcIds and storing an array of offsets into

Modify text in spark-packages

2016-02-23 Thread Sergio Ramírez
Hello, I have some problems in modifying the description of some of my packages in spark-packages.com. However, I haven't been able to change anything. I've written to the e-mail direction in charge of managing this page, but I got no answer. Any clue? Thanks

ORC file writing hangs in pyspark

2016-02-23 Thread James Barney
I'm trying to write an ORC file after running the FPGrowth algorithm on a dataset of around just 2GB in size. The algorithm performs well and can display results if I take(n) the freqItemSets() of the result after converting that to a DF. I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on

Re: Accessing Web UI

2016-02-23 Thread Vasanth Bhat
Hi Gourav, The spark version is spark-1.6.0-bin-hadoop2.6 . The Java version is JDK 8. I have also tried with JDK 7 also, but the results are same. Thanks Vasanth On Tue, Feb 23, 2016 at 2:57 PM, Gourav Sengupta wrote: > Hi, > > This should really

Re: Opening a JIRA for QuantileDiscretizer bug

2016-02-23 Thread Sean Owen
Good catch, though probably very slightly simpler to write math.min(requiredSamples.toDouble ... Make sure you're logged in to JIRA maybe. If you have any trouble I'll open it for you. You can file it as a minor bug against ML. This is how you open a PR and everything else

Re: Accessing Web UI

2016-02-23 Thread Vasanth Bhat
Hi, > >Is there a way to provide minThreads and maxThreds for > Threadpool through jetty.xml for the jetty that is used by spark Web > UI? > > I am hitting an issue very similar to the issue described in >

Re: spark core api vs. google cloud dataflow

2016-02-23 Thread Reynold Xin
That's the just transform function in DataFrame /** * Concise syntax for chaining custom transformations. * {{{ * def featurize(ds: DataFrame) = ... * * df * .transform(featurize) * .transform(...) * }}} * @since 1.6.0 */ def transform[U](t: DataFrame

spark core api vs. google cloud dataflow

2016-02-23 Thread lonely Feb
oogle Cloud Dataflow provides distributed dataset which called PCollection, and syntactic sugar based on PCollection is provided in the form of "apply". Note that "apply" is different from spark api "map" which passing each element of the source through a function func. I wonder can spark support