Re: Rule Engine for Spark

2015-11-04 Thread Daniel Mahler
I am not familiar with any rule engines on Spark Streaming or even plain Spark Conceptually closest things I am aware of are Datomic and Bloom-lang. Neither of them are Spark based but they implement Datalog like languages over distributed stores. - http://www.datomic.com/ - http://bloom-lan

Error using json4s with Apache Spark in spark-shell

2015-06-11 Thread Daniel Mahler
can I get the extraction to work when distributed to the workers? Wesley Miao has reproduced the problem and found that it is specific to spark-shell. He reports that this code works as a standalone application. thanks Daniel Mahler I am trying to use the case class extra

tachyon on machines launched with spark-ec2 scripts

2015-04-24 Thread Daniel Mahler
I have a cluster launched with spark-ec2. I can see a TachyonMaster process running, but I do not seem to be able to use tachyon from the spark-shell. if I try rdd.saveAsTextFile("tachyon://localhost:19998/path") I get 15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID 2

Re: problem writing to s3

2015-04-23 Thread Daniel Mahler
Apr 23, 2015 at 12:11 AM, Daniel Mahler wrote: > >> Hi Akhil, >> >> It works fine when outprefix is a hdfs:///localhost/... url. >> >> It looks to me as if there is something about spark writing to the same >> s3 bucket it is reading from. >> >&g

spark-ec2 s3a filesystem support and hadoop versions

2015-04-22 Thread Daniel Mahler
I would like to easily launch a cluster that supports s3a file systems. if I launch a cluster with `spark-ec2 --hadoop-major-version=2`, what determines the minor version of hadoop? Does it depend on the spark version being launched? Are there other allowed values for --hadoop-major-version besi

Re: problem writing to s3

2015-04-22 Thread Daniel Mahler
our worker logs and see whats happening in there? Are you > able to write the same to your HDFS? > > Thanks > Best Regards > > On Wed, Apr 22, 2015 at 4:45 AM, Daniel Mahler wrote: > >> I am having a strange problem writing to s3 that I have distilled to this >> mi

problem writing to s3

2015-04-21 Thread Daniel Mahler
I am having a strange problem writing to s3 that I have distilled to this minimal example: def jsonRaw = s"${outprefix}-json-raw" def jsonClean = s"${outprefix}-json-clean" val txt = sc.textFile(inpath)//.coalesce(shards, false) txt.count val res = txt.saveAsTextFile(jsonRaw) val txt2 = sc.text

HiveContext vs SQLContext

2015-04-20 Thread Daniel Mahler
Is HiveContext still preferred over SQLContext? What are the current (1.3.1) diferences between them? thanks Daniel

Re: Problem getting program to run on 15TB input

2015-04-13 Thread Daniel Mahler
Sometimes a large number of partitions leads to memory problems. Something like val rdd1 = sc.textFile(file1).coalesce(500). ... val rdd2 = sc.textFile(file2).coalesce(500). ... may help. On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra wrote: > Everything works smoothly if I do the 99%-removal fi

Cleaning/transforming json befor converting to SchemaRDD

2014-11-03 Thread Daniel Mahler
I am trying to convert terabytes of json log files into parquet files. but I need to clean it a little first. I end up doing the following txt = sc.textFile(inpath).coalesce(800) val json = (for { line <- txt JObject(child) = parse(line) child2 = (for { JFiel

Re: union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
the > schema on the results. > > Matei > > > On Nov 1, 2014, at 3:57 PM, Daniel Mahler wrote: > > > > I would like to combine 2 parquet tables I have create. > > I tried: > > > > sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fil

union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB")) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-10-30 Thread Daniel Mahler
hadoop/ where you can see data node dir property > which will be a comma separated list of volumes. > > Thanks > Best Regards > > On Thu, Oct 30, 2014 at 5:21 AM, Daniel Mahler wrote: > >> I started my ec2 spark cluster with >> >> ./ec2/spark---ebs-vol

use additional ebs volumes for hsdf storage with spark-ec2

2014-10-29 Thread Daniel Mahler
I started my ec2 spark cluster with ./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10 launch mycluster I see the additional volumes attached but they do not seem to be set up for hdfs. How can I check if they are being utilized on all workers, and how can I get all workers to

Fwd: Saving very large data sets as Parquet on S3

2014-10-24 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8) data.registerAsTable("data") data.saveAsParquetFile("s3n://target/path) Th

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
partition. > > It might be a nice user hint if Spark warned when parallelism is disabled > by the input format. > > Nick > ​ > > On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler wrote: > >> Hi Nicholas, >> >> Gzipping is a an impressive guess! Yes, they are. >&

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
d and just go > with defaults all around? > > On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler wrote: > >> I launch the cluster using vanilla spark-ec2 scripts. >> I just specify the number of slaves and instance type >> >> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Ma

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler wrote: > I usually run interactively from the spark-shell. > My data definitely has more than enough partitions to keep all the w

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
> you are seeing for it? And can you report on how many partitions your RDDs > have? > > On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler wrote: > >> >> I am launching EC2 clusters using the spark-ec2 scripts. >> My understanding is that this configures spark to use the

Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on ma

Saving very large data sets as Parquet on S3

2014-10-20 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8) data.registerAsTable("data") data.saveAsParquetFile("s3n://target/path) Th

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler wrote: > I agree that for updating rsync is probably preferable, and it seems like > for that purpose it would also parallelize well since most of the time is > spent computing checksums so the process is not constrained by the total >

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
w.mosharaf.com/ >>> >>> >>> On Sun, May 18, 2014 at 11:07 PM, Andrew Ash wrote: >>> >>>> My first thought would be to use libtorrent for this setup, and it >>>> turns out that both Twitter and Facebook do code deploys with a bittorrent

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
, and it turns >>> out that both Twitter and Facebook do code deploys with a bittorrent setup. >>> Twitter even released their code as open source: >>> >>> >>> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent >>> &

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
but we > do want to minimize the complexity of our standard ec2 launch scripts to > reduce the chance of something breaking. > > > On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler wrote: > >> I am launching a rather large cluster on ec2. >> It seems like the launch i

sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very

making spark/conf/spark-defaults.conf changes take effect

2014-05-18 Thread Daniel Mahler
I am running in an aws ec2 cluster that i launched using the spark-ec2 script that comes with spark and I use the "-v master" option to run the head version. If I then log into master and make changes to spark/conf/spark-defaults.conf How do I make the changes take effect across the cluster? Is j

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread Daniel Mahler
Hi Matei, Thanks for the suggestions. Is the number of partitions set by calling 'myrrd.partitionBy(new HashPartitioner(N))'? Is there some heuristic formula for choosing a good number of partitions? thanks Daniel On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia wrote: > Make sure you set up e

Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Daniel Mahler
I have had a lot of success with Spark on large datasets, both in terms of performance and flexibility. However I hit a wall with reduceByKey when the RDD contains billions of items. I am reducing with simple functions like addition for building histograms, so the reduction process should be consta