Re: SparkSQL with large result size

2016-05-10 Thread Buntu Dev
case? > > Regards, > Gourav > > On Mon, May 2, 2016 at 5:59 PM, Ted Yu wrote: > >> That's my interpretation. >> >> On Mon, May 2, 2016 at 9:45 AM, Buntu Dev < >> buntu...@gmail.com> wrote: >> >>> Thanks Ted, I thought the avg. block s

Re: pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
n Sat, May 7, 2016 at 11:48 PM, Buntu Dev wrote: > > I'm using pyspark dataframe api to sort by specific column and then > saving > > the dataframe as parquet file. But the resulting parquet file doesn't > seem > > to be sorted. > > > > Applying sort a

pyspark dataframe sort issue

2016-05-07 Thread Buntu Dev
I'm using pyspark dataframe api to sort by specific column and then saving the dataframe as parquet file. But the resulting parquet file doesn't seem to be sorted. Applying sort and doing a head() on the results shows the correct results sorted by 'value' column in desc order, as shown below: ~~~

Re: SparkSQL with large result size

2016-05-02 Thread Buntu Dev
Mon, May 2, 2016 at 6:21 AM, Ted Yu wrote: > Please consider decreasing block size. > > Thanks > > > On May 1, 2016, at 9:19 PM, Buntu Dev wrote: > > > > I got a 10g limitation on the executors and operating on parquet dataset > with block size 70M with 200 blocks.

SparkSQL with large result size

2016-05-01 Thread Buntu Dev
I got a 10g limitation on the executors and operating on parquet dataset with block size 70M with 200 blocks. I keep hitting the memory limits when doing a 'select * from t1 order by c1 limit 100' (ie, 1M). It works if I limit to say 100k. What are the options to save a large dataset without ru

Re: Dataframe fails for large resultsize

2016-04-29 Thread Buntu Dev
at 6:01 PM, Krishna wrote: > I recently encountered similar network related errors and was able to fix > it by applying the ethtool updates decribed here [ > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085] > > > On Friday, April 29, 2016, Buntu Dev wrot

Re: Dataframe fails for large resultsize

2016-04-29 Thread Buntu Dev
e error I would ultimately want to store the result set as parquet. Are there any other options to handle this? Thanks! On Wed, Apr 27, 2016 at 11:10 AM, Buntu Dev wrote: > I got 14GB of parquet data and when trying to apply order by using spark > sql and save the first 1M rows but ke

Dataframe fails for large resultsize

2016-04-27 Thread Buntu Dev
I got 14GB of parquet data and when trying to apply order by using spark sql and save the first 1M rows but keeps failing with "Connection reset by peer: socket write error" on the executors. I've allocated about 10g to both driver and the executors along with setting the maxResultSize to 10g but

Re: How to estimate the size of dataframe using pyspark?

2016-04-12 Thread Buntu Dev
roduce it (could generate fake > dataset)? > > On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev wrote: > > I've allocated about 4g for the driver. For the count stage, I notice the > > Shuffle Write to be 13.9 GB. > > > > On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Buntu Dev
gt; Looks like the exception occurred on driver. > > Consider increasing the values for the following config: > > conf.set("spark.driver.memory", "10240m") > conf.set("spark.driver.maxResultSize", "2g") > > Cheers > > On Sat, Apr 9, 2016 a

Re: Graphframes pattern causing java heap space errors

2016-04-09 Thread Buntu Dev
t; Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Sat, Apr 9, 2016 at 7:51 PM, Buntu Dev wrote: > > I'm running th

Graphframes pattern causing java heap space errors

2016-04-09 Thread Buntu Dev
I'm running this motif pattern against 1.5M vertices (5.5mb) and 10M (60mb) edges: tgraph.find("(a)-[]->(b); (c)-[]->(b); (c)-[]->(d)") I keep running into Java heap space errors: ~ ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-33]

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
I've allocated about 4g for the driver. For the count stage, I notice the Shuffle Write to be 13.9 GB. On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR wrote: > What's the size of your driver? > On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > >> Actually, df.show() wor

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
Actually, df.show() works displaying 20 rows but df.count() is the one which is causing the driver to run out of memory. There are just 3 INT columns. Any idea what could be the reason? On Sat, Apr 9, 2016 at 10:47 AM, wrote: > You seem to have a lot of column :-) ! > df.count() displays the si

Re: Dataframe to parquet using hdfs or parquet block size

2016-04-07 Thread Buntu Dev
I tried setting both the hdfs and parquet block size but write to parquet did not seem to have had any effect on the total number of blocks or the average block size. Here is what I did: sqlContext.setConf("dfs.blocksize", "134217728") sqlContext.setConf("parquet.block.size", "134217728") sql

Re: Rule Engine for Spark

2015-11-05 Thread Buntu Dev
You may want to read this post regarding Spark with Drools: http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/ On Wed, Nov 4, 2015 at 8:05 PM, Daniel Mahler wrote: > I am not familiar with any rule engines on Spark Streaming or even pl

Re: Algebird using spark-shell

2014-10-30 Thread Buntu Dev
Thanks.. I was using Scala 2.11.1 and was able to use algebird-core_2.10-0.1.11.jar with spark-shell. On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell wrote: > Whats the error with the 2.10 version of algebird? > > On Thu, Oct 30, 2014 at 12:49 AM, thadude wrote: > >> I've tried: >> >> . /bin/spa

Re: Error while running Streaming examples - no snappyjava in java.library.path

2014-10-20 Thread Buntu Dev
Thanks Akhil. On Mon, Oct 20, 2014 at 1:57 AM, Akhil Das wrote: > Its a known bug in JDK7 and OSX's naming convention, here's how to resolve > it: > > 1. Get the Snappy jar file from > http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ > 2. Copy the appropriate one to your project'

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread Buntu Dev
wrote: > Your RDD does not contain pairs, since you ".map(_._2)" (BTW that can > just be ".values"). "Hadoop files" means "SequenceFiles" and those > store key-value pairs. That's why the method only appears for > RDD[(K,V)]. > > On Fri,

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread Buntu Dev
Thanks Sean, but I'm importing org.apache.spark.streaming. StreamingContext._ Here are the spark imports: import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.kafka._ import org.apache.spark.SparkConf val stream =

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Buntu Dev
then it may allow if the schema > changes are append only. Otherwise existing Parquet files have to be > migrated to new schema. > > - Original Message - > From: "Buntu Dev" > To: "Soumitra Kumar" > Cc: u...@spark.incubator.apache.org > Sent: T

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Buntu Dev
Thanks for the info Soumitra.. its a good start for me. Just wanted to know how you are managing schema changes/evolution as parquetSchema is provided to setSchema in the above sample code. On Tue, Oct 7, 2014 at 10:09 AM, Soumitra Kumar wrote: > I have used it to write Parquet files as: > > va

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread Buntu Dev
Thanks for the update.. I'm interested in writing the results to MySQL as well, can you share some light or code sample on how you setup the driver/connection pool/etc.? On Thu, Sep 25, 2014 at 4:00 PM, maddenpj wrote: > Update for posterity, so once again I solved the problem shortly after > po

RDD partitioner or repartition examples?

2014-08-08 Thread buntu
I'm processing about 10GB of tab delimited rawdata with a few fields (page and user id along with timestamp when user viewed the page) using a 40 node cluster and using SparkSQL to compute the number of unique visitors per page at various intervals. I'm currently just reading the data as sc.textFil

Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread buntu
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that processes about 10-15GB raw data but I keep running into this error: java.lang.OutOfMemoryError: GC overhead limit exceeded Each node has 8 cores and 2GB memory. I notice the heap size on the executors is set to 512

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
Thanks Michael for confirming! On Thu, Jul 31, 2014 at 2:43 PM, Michael Armbrust wrote: > The performance should be the same using the DSL or SQL strings. > > > On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev wrote: > >> I was not sure if registerAsTable() and then query again

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
ing in 1.0.0 using DSL only. Just curious, > why don't you use the hql() / sql() methods and pass a query string > in? > > [1] https://github.com/apache/spark/pull/1211/files > > On Thu, Jul 31, 2014 at 2:20 PM, Buntu Dev wrote: > > Thanks Zongheng for the pointer. Is

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
ect('keyword, countDistinct('userId)).groupBy('keyword) > > On Thu, Jul 31, 2014 at 12:27 PM, buntu wrote: > > I'm looking to write a select statement to get a distinct count on userId > > grouped by keyword column on a parquet file SchemaRDD equivalent of: > >

SchemaRDD select expression

2014-07-31 Thread buntu
I'm looking to write a select statement to get a distinct count on userId grouped by keyword column on a parquet file SchemaRDD equivalent of: SELECT keyword, count(distinct(userId)) from table group by keyword How to write it using the chained select().groupBy() operations? Thanks! -- View

Re: spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
Turns out to be an issue with number of fields being read, one of the fields might be missing from the raw data file causing this error. Michael Ambrust pointed it out in another thread. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-running-int

Re: Convert raw data files to Parquet format

2014-07-23 Thread buntu
That seems to be the issue, when I reduce the number of fields it works perfectly fine. Thanks again Michael.. that was super helpful!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10541.html Sent from the

Re: Convert raw data files to Parquet format

2014-07-23 Thread buntu
Thanks Michael. If I read in multiple files and attempt to saveAsParquetFile() I get the ArrayIndexOutOfBoundsException. I don't see this if I try the same with a single file: > case class Point(dt: String, uid: String, kw: String, tz: Int, success: > Int, code: String ) > val point = sc.textFil

Convert raw data files to Parquet format

2014-07-23 Thread buntu
I wanted to experiment with using Parquet data with SparkSQL. I got some tab-delimited files and wanted to know how to convert them to Parquet format. I'm using standalone spark-shell. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data

Re: spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
Just wanted to add more info.. I was using SparkSQL reading in the tab-delimited raw data files converting the timestamp to Date format: sc.textFile("rawdata/*").map(_.split("\t")).map(p => Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2).trim.toInt ,p(3).trim.toInt, p(4).trim.toI

spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
I'm using the spark-shell locally and working on a dataset of size 900MB. I initially ran into "java.lang.OutOfMemoryError: GC overhead limit exceeded" error and upon researching set "SPARK_DRIVER_MEMORY" to 4g. Now I run into ArrayIndexOutOfBoundsException, please let me know if there is some way

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread buntu
If you need to run Spark apps through Hue, see if Ooyala's job server helps: http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-deployed-by-Cloudera-Manag

Re: Apache kafka + spark + Parquet

2014-07-22 Thread buntu
> Now we are storing Data direct from Kafka to Parquet. We are currently using Camus and wanted to know how you went about storing to Parquet? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html Sent from the Apache

Spark app vs SparkSQL app

2014-07-22 Thread buntu
I could possible use Spark API and write an batch app to provide some per web page stats such as views, uniques etc. The same can be achieved using SparkSQL, so wanted to check: * what are the best practices and pros/cons of either of the approaches? * Does SparkSQL require registerAsTable for eve

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields. I'm gonna play with it now. Thanks again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html Sent from the Apache Spark User List mail

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thats is correct Raffy. Assume I convert the timestamp field to date and in the required format, is it possible to report it by date? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9790.html Sent from the Apache Spar

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Nick. All I'm attempting is to report number of unique visitors per page by date. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html Sent from the Apache Spark User List mailing list archive at Nabble.c

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be available in CDH 5.1 which is yet to be released. If Spark SQL is the only option then I might need to hack around to add it into the current CDH deployment if thats possible. -- View this message in context: http://apache-s

Count distinct with groupBy usage

2014-07-15 Thread buntu
Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given this raw data: timestamp,page,userId 1405377264,google,user1 1405378589,google,user2 1405380012,yahoo,user1 .. I can do a groupBy a field and get the count: val lines=sc.textFile("data.csv") va

Eclipse Spark plugin and sample Scala projects

2014-07-14 Thread buntu
Hi -- I tried searching for eclipse spark plugin setup for developing with Spark and there seems to be some information I can go with but I have not seen a starter app or project to import into Eclipse and try it out. Can anyone please point me to any Scala projects to import into Scala Eclipse ID