I am not familiar with any rule engines on Spark Streaming or even plain
Spark
Conceptually closest things I am aware of are Datomic and Bloom-lang.
Neither of them are Spark based but they implement Datalog like languages
over distributed stores.
- http://www.datomic.com/
- http://bloom-lan
can I get the extraction to work when distributed to the workers?
Wesley Miao has reproduced the problem and found that it is specific to
spark-shell. He reports that this code works as a standalone application.
thanks
Daniel Mahler
I am trying to use the case class extra
I have a cluster launched with spark-ec2.
I can see a TachyonMaster process running,
but I do not seem to be able to use tachyon from the spark-shell.
if I try
rdd.saveAsTextFile("tachyon://localhost:19998/path")
I get
15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID
2
Apr 23, 2015 at 12:11 AM, Daniel Mahler wrote:
>
>> Hi Akhil,
>>
>> It works fine when outprefix is a hdfs:///localhost/... url.
>>
>> It looks to me as if there is something about spark writing to the same
>> s3 bucket it is reading from.
>>
>&g
I would like to easily launch a cluster that supports s3a file systems.
if I launch a cluster with `spark-ec2 --hadoop-major-version=2`,
what determines the minor version of hadoop?
Does it depend on the spark version being launched?
Are there other allowed values for --hadoop-major-version besi
our worker logs and see whats happening in there? Are you
> able to write the same to your HDFS?
>
> Thanks
> Best Regards
>
> On Wed, Apr 22, 2015 at 4:45 AM, Daniel Mahler wrote:
>
>> I am having a strange problem writing to s3 that I have distilled to this
>> mi
I am having a strange problem writing to s3 that I have distilled to this
minimal example:
def jsonRaw = s"${outprefix}-json-raw"
def jsonClean = s"${outprefix}-json-clean"
val txt = sc.textFile(inpath)//.coalesce(shards, false)
txt.count
val res = txt.saveAsTextFile(jsonRaw)
val txt2 = sc.text
Is HiveContext still preferred over SQLContext?
What are the current (1.3.1) diferences between them?
thanks
Daniel
Sometimes a large number of partitions leads to memory problems.
Something like
val rdd1 = sc.textFile(file1).coalesce(500). ...
val rdd2 = sc.textFile(file2).coalesce(500). ...
may help.
On Mon, Mar 2, 2015 at 6:26 PM, Arun Luthra wrote:
> Everything works smoothly if I do the 99%-removal fi
I am trying to convert terabytes of json log files into parquet files.
but I need to clean it a little first.
I end up doing the following
txt = sc.textFile(inpath).coalesce(800)
val json = (for {
line <- txt
JObject(child) = parse(line)
child2 = (for {
JFiel
the
> schema on the results.
>
> Matei
>
> > On Nov 1, 2014, at 3:57 PM, Daniel Mahler wrote:
> >
> > I would like to combine 2 parquet tables I have create.
> > I tried:
> >
> > sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fil
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))
but that just returns RDD[Row].
How do I combine them to get a SchemaRDD[Row]?
thanks
Daniel
hadoop/ where you can see data node dir property
> which will be a comma separated list of volumes.
>
> Thanks
> Best Regards
>
> On Thu, Oct 30, 2014 at 5:21 AM, Daniel Mahler wrote:
>
>> I started my ec2 spark cluster with
>>
>> ./ec2/spark---ebs-vol
I started my ec2 spark cluster with
./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10
launch mycluster
I see the additional volumes attached but they do not seem to be set up for
hdfs.
How can I check if they are being utilized on all workers,
and how can I get all workers to
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8)
data.registerAsTable("data")
data.saveAsParquetFile("s3n://target/path)
Th
partition.
>
> It might be a nice user hint if Spark warned when parallelism is disabled
> by the input format.
>
> Nick
>
>
> On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler wrote:
>
>> Hi Nicholas,
>>
>> Gzipping is a an impressive guess! Yes, they are.
>&
d and just go
> with defaults all around?
>
> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler wrote:
>
>> I launch the cluster using vanilla spark-ec2 scripts.
>> I just specify the number of slaves and instance type
>>
>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Ma
I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type
On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler wrote:
> I usually run interactively from the spark-shell.
> My data definitely has more than enough partitions to keep all the w
> you are seeing for it? And can you report on how many partitions your RDDs
> have?
>
> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler wrote:
>
>>
>> I am launching EC2 clusters using the spark-ec2 scripts.
>> My understanding is that this configures spark to use the
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use the available memory on larger istance types.
However I have never seen spark running at more than 400% (using 100% on 4
cores)
on ma
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8)
data.registerAsTable("data")
data.saveAsParquetFile("s3n://target/path)
Th
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler wrote:
> I agree that for updating rsync is probably preferable, and it seems like
> for that purpose it would also parallelize well since most of the time is
> spent computing checksums so the process is not constrained by the total
>
w.mosharaf.com/
>>>
>>>
>>> On Sun, May 18, 2014 at 11:07 PM, Andrew Ash wrote:
>>>
>>>> My first thought would be to use libtorrent for this setup, and it
>>>> turns out that both Twitter and Facebook do code deploys with a bittorrent
, and it turns
>>> out that both Twitter and Facebook do code deploys with a bittorrent setup.
>>> Twitter even released their code as open source:
>>>
>>>
>>> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent
>>>
&
but we
> do want to minimize the complexity of our standard ec2 launch scripts to
> reduce the chance of something breaking.
>
>
> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler wrote:
>
>> I am launching a rather large cluster on ec2.
>> It seems like the launch i
I am launching a rather large cluster on ec2.
It seems like the launch is taking forever on
Setting up spark
RSYNC'ing /root/spark to slaves...
...
It seems that bittorrent might be a faster way to replicate
the sizeable spark directory to the slaves
particularly if there is a lot of not very
I am running in an aws ec2 cluster that i launched using the spark-ec2
script that comes with spark
and I use the "-v master" option to run the head version.
If I then log into master and make changes to spark/conf/spark-defaults.conf
How do I make the changes take effect across the cluster?
Is j
Hi Matei,
Thanks for the suggestions.
Is the number of partitions set by calling 'myrrd.partitionBy(new
HashPartitioner(N))'?
Is there some heuristic formula for choosing a good number of partitions?
thanks
Daniel
On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia wrote:
> Make sure you set up e
I have had a lot of success with Spark on large datasets,
both in terms of performance and flexibility.
However I hit a wall with reduceByKey when the RDD contains billions of
items.
I am reducing with simple functions like addition for building histograms,
so the reduction process should be consta
29 matches
Mail list logo