Put many small files in Hadoop Archives (HAR) to improve performance of reading
small files. Alternatively have a batch job concatenating them.
> On 11 Feb 2016, at 18:33, Junjie Qian wrote:
>
> Hi all,
>
> I am working with Spark 1.6, scala and have a big dataset
Can you provide more details? Your use case does not sound you need Spark.
Your version is anyway too old. It does not make sense to develop now with
1.2.1 . There is no "project limitation" that is able to justify this.
> On 08 Feb 2016, at 06:48, Meetu Maltiar wrote:
This could be done through odbc. Keep in mind that you can run SaS jobs
directly on a Hadoop cluster using the SaS embedded process engine or dump some
data to SaS lasr cluster, but you better ask SaS about this.
> On 03 Feb 2016, at 18:43, Sourav Mazumder wrote:
>
It depends of course on the background of the people but how about some
examples ("word count") how it works in the background.
> On 01 Feb 2016, at 07:31, charles li wrote:
>
>
> Apache Spark™ is a fast and general engine for large-scale data processing.
>
> it's a
Spark has its best use cases in in-memory batch processing / machine learning.
Connecting multiple different sources/destination requires some thinking and
probably more than spark.
Connecting spark to a database makes only in very few cases sense. You will
have huge performance issues due to
What is your data size, the algorithm and the expected time?
Depending on this the group can recommend you optimizations or tell you that
the expectations are wrong
> On 20 Jan 2016, at 18:24, Pavel Plotnikov
> wrote:
>
> Thanks, Akhil! It helps, but this jobs
Ignite can also cache rdd
> On 12 Jan 2016, at 13:06, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote:
>
> Jorn, you said Ignite or ... ? What was the second choice you were thinking
> of? It seems that got omitted.
>
>> On Jan 12, 2016, at 2:44 AM, Jörn Franke &
You can look at ignite as a HDFS cache or for storing rdds.
> On 11 Jan 2016, at 21:14, Dmitry Goldenberg wrote:
>
> We have a bunch of Spark jobs deployed and a few large resource files such as
> e.g. a dictionary for lookups or a statistical model.
>
> Right now,
You can mark the table as transactional and then you can do single updates.
> On 07 Jan 2016, at 08:10, sudhir wrote:
>
> Hi,
>
> I have a hive table of 20Lakh records and to update a row I have to load the
> entire table in dataframe and process that and then Save it
You need the table in an efficient format, such as Orc or parquet. Have the
table sorted appropriately (hint: most discriminating column in the where
clause). Do not use SAN or virtualization for the slave nodes.
Can you please post your query.
I always recommend to avoid single updates where
Flume could be interesting for you.
> On 19 Dec 2015, at 00:27, SRK wrote:
>
> Hi,
>
> How to run multiple Spark jobs that takes Spark Streaming data as the
> input as a workflow in Oozie? We have to run our Streaming job first and
> then have a workflow of Spark
Can you elaborate a little bit more on the use case? It looks a little bit like
an abuse of Spark in general . Interactive queries that are not suitable for
in-memory batch processing might be better supported by ignite that has
in-memory indexes, concept of hot, warm, cold data etc. or hive on
You may want to use a bloom filter for this, but make sure that you understand
how it works
> On 08 Dec 2015, at 09:44, Ramkumar V wrote:
>
> Im running spark batch job in cluster mode every hour and it runs for 15
> minutes. I have certain unique keys in the dataset.
This would require a special HDFS log4j appender. Alternatively try the flume
log4j appender
> On 08 Dec 2015, at 13:00, sunil m <260885smanik...@gmail.com> wrote:
>
> Hi!
> I configured log4j.properties file in conf folder of spark with following
> values...
>
>
I am not sure about your use case. How should a human interpret many terabytes
of data in one large visualization?? You have to be more specific, what part of
the data needs to be visualized, what kind of visualization, what navigation do
you expect within the visualisation, how many users,
can you elaborate more on the use case?
> On 01 Dec 2015, at 20:51, Andrés Ivaldi wrote:
>
> Hi,
>
> I'd like to use spark to perform some transformations over data stored inSQL,
> but I need low Latency, I'm doing some test and I run into spark context
> creation and
> as Measure and Prodcut, Product Family as Dimension
>
> Only 3 columns, it takes like 20s to perform that query and the aggregation,
> the query directly to the database with a grouping at the columns takes like
> 1s
>
> regards
>
>
>
>> On Tue, Dec 1, 2
I would not use MongoDB because it does not fit well into the Spark or Hadoop
architecture. You can use it if your data amount is very small and already
preaggregated, but this is a very limited use case. You can use Hbase or with
future versions of Hive (if they use TEZ > 0.8) For interactive
Hi,
I recommend to use the latest version of Hive. You may also wait for hive on
tez with tez version >= 0.8 and hive > 1.2. Before that I recommend first
trying other optimizations of Hive and have a look at the storage format
together with storage indexes (not the regular ones), bloom
I would have to check the Spark source code, but theoretically you can limit
the no of threads on the jvm level. Maybe spark does this.Alternatively, you
can use cgroups, but this introduces other complexity.
> On 10 Nov 2015, at 14:33, Peter Rudenko wrote:
>
> Hi i
Maybe you look for web sockets/stomp to get it to the end user? Or http2/stomp
in the future
> On 10 Nov 2015, at 21:28, Andy Davidson wrote:
>
> I just finished watching a great presentation from a recent spark summit on
> real time movie recommendations using
Is there any distributor supporting these software components in combination?
If no and your core business is not software then you may want to look for
something else, because it might not make sense to build up internal know-how
in all of these areas.
In any case - it depends all highly on
Do you use some compression? Maybe there is some activated by default in your
Hadoop environment?
> On 06 Nov 2015, at 00:34, rok wrote:
>
> Apologies if this appears a second time!
>
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
>
You can check a script that I created for the Amazon cloud:
https://snippetessay.wordpress.com/2015/04/18/big-data-lab-in-the-cloud-with-hadoopsparkrpython/
If I remember correctly then you need to add something to the startup py for
ipython
> On 03 Nov 2015, at 01:04, Andy Davidson
Try with max date, in your case it could make more sense to represent the date
as int
Sent from my iPhone
> On 01 Nov 2015, at 21:03, Koert Kuipers wrote:
>
> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date,
Maybe Hortonworks support can help you much better.
Otherwise you may want to change the yarn scheduler configuration and
preemption. Do you use something like speculative execution?
How do you start execution of the programs? Maybe you are already using all
cores of the master...
> On 30 Oct
What Storage Format?
> On 30 Oct 2015, at 12:05, Rex Xiong wrote:
>
> Hi folks,
>
> I have a Hive external table with partitions.
> Every day, an App will generate a new partition day=-MM-dd stored by
> parquet and run add-partition Hive command.
> In some cases, we
Maybe SparkR? What languages do your Users speak?
> On 26 Oct 2015, at 23:12, danilo wrote:
>
> Hi All, I want to create a monitoring tool using my sensor data. I receive
> the events every seconds and I need to create a report using node.js. Right
> now I created my kpi
Good Formats are Parquet or ORC. Both can be useful with compression, such as
Snappy. They are much faster than JSON. however, the table structure is up to
you and depends on your use case.
> On 17 Oct 2015, at 23:07, Gavin Yue wrote:
>
> I have json files which
I am not aware of any empirical evidence, but I think hadoop (HDFS) as a
datastore for Spark is quiet common. With relational databases you usually
do not have so much data and you do not benefit from data locality.
Le ven. 9 oct. 2015 à 15:16, Rahul Jeevanandam a
écrit :
>
nks a lot !
> Nicolas
>
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, "Brett Antonides" <banto...@gmail.com>
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: H
pose the records are still updatable.
>
> Tks to confirm if it can be solution for my use case. Or any other idea..
>
> Thanks a lot !
> Nicolas
>
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, "Brett Anto
>
> Nicolas
>
>
>
>
> Envoyé depuis mon appareil mobile Samsung
>
> Jörn Franke <jornfra...@gmail.com> a écrit :
>
> If you use transactional tables in hive together with insert, update,
> delete then it does the "concatenate " for you a
t;
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (na
. 2015 à 16:48, <nib...@free.fr> a écrit :
> Thanks a lot, why you said "the most recent version" ?
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: "nibiau" <nib...@free.fr>
> Cc: banto...@gmail.com, user
Use hadoop archive
Le dim. 27 sept. 2015 à 15:36, a écrit :
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them
Once the data is consolidated in Oracle, it serves as the source
> of truth for external users.
>
> Regards,
> Sri Eswari.
>
> On Mon, Sep 21, 2015 at 10:55 PM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> You do not need Hadoop. However, you should think abou
You do not need Hadoop. However, you should think about using it. If you
use Spark to load data directly from Oracle then your database might have
unexpected loads of data once a Spark node may fail. Additionally, the
Oracle Database, if it is not based on local disk, may have a storage
l storage lead to high latency in my app.
>
> 3/ How to get real-time statistics from Spark,
> In most of the Spark streaming examples, the statistics are echo to the
> stdout.
> However, I want to display those statics on GUI, is there any way to
> retrieve data from Spark directl
If you want to be able to let your users query their portfolio then you may
want to think about storing the current state of the portfolios in
hbase/phoenix or alternatively a cluster of relationaldatabases can make
sense. For the rest you may use Spark.
Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê
Why did you not stay with the batch approach? For me the architecture looks
very complex for a simple thing you want to achieve. Why don't you process
the data already in storm ?
Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi a
écrit :
> I am pretty new to spark.
I fear you have to do the plumbing all yourself. This is the same for all
commercial and non-commercial libraries/analytics packages. It often also
depends on the functional requirements on how you distribute.
Le sam. 12 sept. 2015 à 20:18, Rex X a écrit :
> Hi everyone,
>
>
I am not sure what are you trying to achieve here. Have you thought about
using flume? Additionally maybe something like rsync?
Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar
a écrit :
> Hi all,
>I have a coded a custom receiver which receives kafka messages.
What do you mean by import? All ways have advantages and disadvantages. You
may first think about when you can make large extractions of data from the
database into Spark. You also may think about if the database should be the
persistent storage of the data or if you need something aside of the
Can you use a map or list with different properties as one parameter?
Alternatively a string where parameters are Comma-separated...
Le lun. 7 sept. 2015 à 8:35, Night Wolf a écrit :
> Is it possible to have a UDF which takes a variable number of arguments?
>
> e.g.
You can always use the ml libs in R, but you have to integrate them in
sparkr (= make all the logic to run in parallel etc). However, for your use
case it may make more sense to write the wrapper R mllib yourself, if the
project cannot provide it in time. It is not that difficult to call java or
Maybe you can tell us more about your use case, I have somehow the feeling
that we are missing sth here
Le jeu. 3 sept. 2015 à 15:54, Jörn Franke <jornfra...@gmail.com> a écrit :
>
> Store them as hadoop archive (har)
>
> Le mer. 2 sept. 2015 à 18:07, <nib...@free.fr&
Store them as hadoop archive (har)
Le mer. 2 sept. 2015 à 18:07, a écrit :
> Hello,
> I'am currently using Spark Streaming to collect small messages (events) ,
> size being <50 KB , volume is high (several millions per day) and I have to
> store those messages in HDFS.
> I
You might think about another storage layer not being mongodb
(hdfs+orc+compression or hdfs+parquet+compression) to improve performance
Le jeu. 3 sept. 2015 à 9:15, Akhil Das a
écrit :
> On SSD you will get around 30-40MB/s on a single machine (on 4 cores).
>
>
guha <guha.a...@gmail.com> wrote:
>
>> Thanks for your info. I am planning to implement a pig udf to do record
>> look ups. Kindly let me know if this is a good idea.
>>
>> Best
>> Ayan
>>
>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jo
o use Pig on it
> and what about performances ?
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, user@spark.apache.org
> Envoyé: Jeudi 3 Septembre 2015 15:54:42
> Objet: Re: Small File to HDFS
>
>
>
>
> Store them as h
(remove/replace) a file inside the HAR ?
> Basically the name of my small files will be the keys of my records , and
> sometimes I will need to replace the content of a file by a new content
> (remove/replace)
>
>
> Tks a lot
> Nicolas
>
> ----- Mail original -
>
Well if it needs to read from hdfs then it will adhere to the permissions
defined there And/or in ranger. However, I am not aware that you can
protect dataframes, tables or streams in general in Spark.
Le jeu. 3 sept. 2015 à 21:47, Daniel Schulz
a écrit :
> Hi
You may check if it makes sense to write a coprocessor doing an upsert for
you, if it does not exist already. Maybe phoenix for Hbase supports this
already.
Another alternative, if the records do not have an unique Id, is to put
them into a text index engine, such as Solr or Elasticsearch, which
Depends on what you need to do. Can you tell more about your use cases?
Le mar. 1 sept. 2015 à 13:07, rakesh sharma a
écrit :
> Is it mature enough to use it extensively. I see that it is easier to do
> than writing map/reduce in java.
> We are being asked to do it
I think there is already an example for this shipped with Spark. However,
you do not benefit really from any spark functionality for this scenario.
If you want to do something more advanced you should look at Elasticsearch
or Solr
Le ven. 28 août 2015 à 16:15, Darksu nick_tou...@hotmail.com a
Have you tried tablesample? You find the exact syntax in the documentation,
but it exlxactly does what you want
Le mer. 26 août 2015 à 18:12, Thomas Dudziak tom...@gmail.com a écrit :
Sorry, I meant without reading from all splits. This is a single partition
in the table.
On Wed, Aug 26,
I would use Sqoop. It has been designed exactly for these types of
scenarios. Spark streaming does not make sense here
Le dim. 5 juil. 2015 à 1:59, ayan guha guha.a...@gmail.com a écrit :
Hi All
I have a requireent to connect to a DB every few minutes and bring data to
HBase. Can anyone
Hi,
First you need to make your SLA clear. It does not sound for me they are
defined very well or that your solution is necessary for the scenario. I
also find it hard to believe that 1 customer has 100Million transactions
per month.
Time series data is easy to precalculate - you do not
Check also falcon in combination with oozie
Le ven. 7 août 2015 à 17:51, Hien Luu h...@linkedin.com.invalid a écrit :
Looks like Oozie can satisfy most of your requirements.
On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote:
Hi,
I'm looking for open source workflow
Yes you should use orc it is much faster and more compact. Additionally you
can apply compression (snappy) to increase performance. Your data
processing pipeline seems to be not.very optimized. You should use the
newest hive version enabling storage indexes and bloom filters on
appropriate
in join statements (not where)
otherwise you do a full table scan ignoring partitions
Le jeu. 6 août 2015 à 15:07, Jörn Franke jornfra...@gmail.com a écrit :
Yes you should use orc it is much faster and more compact. Additionally
you can apply compression (snappy) to increase performance. Your
Hi,
I think the combination of Mongodb and Spark is a little bit unlucky.
Why don't you simply use mongodb?
If you want to process a lot of data you should use hdfs or cassandra as
storage. Mongodb is not suitable for heterogeneous processing of large
scale data.
Best regards
Best regards,
I think you use case can already be implemented with HDFS encryption and/or
SealedObject, if you look for sth like Altibase.
If you create a JIRA you may want to set the bar a little bit higher and
propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/
Le ven. 31 juil. 2015 à 10:17,
What Hive Version are you using? Do you run it in on TEZ? Are you using the
ORC Format? Do you use compression? Snappy? Do you use Bloom filters? Do
you insert the data sorted on the right columns? Do you use partitioning?
Did you increase the replication factor for often used tables or
You may check out apache phoenix on top of Hbase for this. However, it does
not have ODBC drivers, but JDBC ones. Maybe Hive 1.2 with a new version of
TEZ will also serve your purpose. You should run some proof of concept with
these technologies using real or generated data. About how much data
Can you put some transparent cache in front of the database? Or some jdbc
proxy?
Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a
écrit :
can the source write to Kafka/Flume/Hbase in addition to Postgres? no
it can't write ,this is due to the fact that there are many
Use a Hadoop distribution that supports Windows and has Spark included.
Generally - if you want to use windows - you should use the server version.
Le sam. 25 juil. 2015 à 20:11, Peter Leventis pleven...@telkomsa.net a
écrit :
I just wanted an easy step by step guide as to exactly what version
He should still see something. I think you need to subscribe to the
Screenname first and not filter it out only in the filter method. I do not
have the apis from mobile at hand, but there should be a method.
Le jeu. 23 juil. 2015 à 22:30, Enno Shioji eshi...@gmail.com a écrit :
You need to pay
Can you provide an example of an and query ? If you do just look-up you
should try Hbase/ phoenix, otherwise you can try orc with storage index
and/or compression, but this depends on how your queries look like
Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a
écrit :
HI
choice?
i dont dont to iterate the result set which Hbase returns and give the
result because this will kill the performance?
On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote:
Can you provide an example of an and query ? If you do just look-up you
should try Hbase/ phoenix
Why do you even want to stop it? You can join it with a rdd loading the
newest hash tags from disk in a regular interval
Le dim. 19 juil. 2015 à 7:40, Zoran Jeremic zoran.jere...@gmail.com a
écrit :
Hi,
I have a twitter spark stream initialized in the following way:
val
Well one of the strength of spark is standardized general distributed
processing allowing many different types of processing, such as graph
processing, stream processing etc. The limitation is that it is less
performant than one system focusing only on one type of processing (eg
graph processing).
What is your business case for the move?
Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani rrav...@gmail.com a écrit :
Hi everyone,
I have planned to move mssql server to spark?. I have using around 50,000
to 1l records.
The spark performance is slow when compared to mssql server.
What is
Honestly you are addressing this wrongly - you do not seem.to have a
business case for changing - so why do you want to switch
Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a
écrit :
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute
engines,
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov ole...@gmail.com a écrit :
Hello. Had the same question. What if I need to store 4-6 Tb and do
queries? Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com:
Hi Ravi,
First, Neither Spark nor
Ironpython shares with python only the syntax - at best. It is a scripting
language within the .NET framework. Many applications have this for
scripting the application itself. This won't work for you. You can use
pipes or write your spark jobs in java/scala/r and submit them via your
.NET
Can you provide the result set you are using and specify how you integrated
the drools engine?
Drools basically is based on a large shared memory. Hence, if you have
several tasks in Shark they end up using different shared memory areas.
A full integration of drools requires some sophisticated
Generally (not only spark sql specific) you should not cast in the where
part of a sql query. It is also not necessary in your case. Getting rid of
casts in the whole query will be also beneficial.
Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a écrit :
Hello,
A colleague
You may compare the c:\windows\system32\drivers\etc\hosts if they are
configured similarly
Le mer. 10 juin 2015 à 17:16, Eran Medan eran.me...@gmail.com a écrit :
I'm on a road block trying to understand why Spark doesn't work for a
colleague of mine on his Windows 7 laptop.
I have pretty
I am not sure they work with HDFS pathes. You may want to look at the
source code. Alternatively you can create a fat jar containing all jars
(let your build tool set correctly METAINF). This always works.
Le mer. 10 juin 2015 à 6:22, Dong Lei dong...@microsoft.com a écrit :
Thanks So much!
Hi, If you have just one physical machine then I would try out Docker
instead of a full VM (would be waste of memory and CPU).
Best regards
Le 20 avr. 2015 00:11, hnahak harihar1...@gmail.com a écrit :
Hi All,
I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I
just
need
Hallo,
Only because you receive the log files hourly it means that you have to use
Spark Streaming. Spark streaming is often used if you receive new events
each minute /second potentially at an irregular frequency. Of course your
analysis window can be larger.
I think your use case justifies
Hallo,
Well all problems you want to solve with technology need to have good
justification for a certain technology. So the first thing is that you ask
which technology fits to my current and future problems. This is also what
the article says. Unfortunately, it does only provide a vague answer
You can also preaggregate results for the queries by the user - depending
on what queries they use this might be necessary for any underlying
technology
Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :
Hi,
I need to store terabytes of data which will be used for BI tools
, kundan kumar iitr.kun...@gmail.com
wrote:
I looking for some options and came across
http://www.jethrodata.com/
On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com
wrote:
You can also preaggregate results for the queries by the user -
depending on what queries they use
You probably need to add the dll directory to the path (not classpath!)
environment variable on all nodes.
Le 26 mars 2015 06:23, Xi Shen davidshe...@gmail.com a écrit :
Not of course...all machines in HDInsight are Windows 64bit server. And I
have made sure all my DLLs are for 64bit machines.
Hallo,
Depending non your needs, search technology, such as SolrCloud or
ElasticSearch makes more sense. If you go for the Cassandra solution you
can use the lucene text indexer...
I am not sure if hive or sparksql are very suitable for text. However, if
you do not need text search then feel free
What database are you using?
Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit :
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute the query against
our huge database and not
You may think as well if your use case really needs a very strict order,
because configuring spark that it supports such a strict order means
rendering most of benefits useless (failure handling, parallelism etc.).
Usually, in a distributed setting you can order events, but this also means
that
Hi,
What do your jobs do? Ideally post source code, but some description would
already helpful to support you.
Memory leaks can have several reasons - it may not be Spark at all.
Thank you.
Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit :
(looks like the list didn't like
I recommend using a build tool within eclipse, such as Gradle or Maven
Le 24 janv. 2015 19:34, riginos samarasrigi...@gmail.com a écrit :
How to compile a Spark project in Scala IDE for Eclipse? I got many scala
scripts and i no longer want to load them from scala-shell what can i do?
--
Did you try it with a smaller subset of the data first?
Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit :
I'm trying to process 5TB of data, not doing anything fancy, just
map/filter and reduceByKey. Spent whole day today trying to get it
processed, but never succeeded. I've
Maybe you use a wrong approach - try something like hyperloglog or bitmap
structures as you can find them, for instance, in redis. They are much
smaller
Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a
écrit :
Thank you Jerry,
Does the window operation create new
Can't you send a special event through spark streaming once the list is
updated? So you have your normal events and a special reload event
Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit :
Hi,
I want to join a DStream with some other dataset, e.g. join a click
stream with a spam
Basically, you have to think about how to split the data (for pictures this
can be for instance 8x8 matrices) and use spark to distribute it to
different workers which themselves call opencv with the data. Afterwards
you need to combine all results again. It really depends on your image /
video
Hallo,
Based on experiences with other software in virtualized environments I
cannot really recommend this. However, I am not sure how Spark reacts. You
may face unpredictable task failures depending on utilization, tasks
connecting to external systems (databases etc.) may fail unexpectedly and
Hallo,
It really depends on your requirements, what kind of machine learning
algorithm your budget, if you do currently something really new or
integrate it with an existing application, etc.. You can run MongoDB as
well as a cluster. I don't think this question can be answered generally,
but
Hi,
What is your cluster setup? How mich memory do you have? How much space
does one row only consisting of the 3 columns consume? Do you run other
stuff in the background?
Best regards
Am 04.12.2014 23:57 schrieb bonnahu bonn...@gmail.com:
I am trying to load a large Hbase table into SPARK
Do you create the application in context of the web service call? Then the
application maybe killed after you return from the web service call.
However, we would need to see what you do during the web service call, how
you invoke the spark application
Le 16 oct. 2014 08:50, Mehdi Singer
401 - 500 of 509 matches
Mail list logo