Re: Batch of updates

2014-10-28 Thread Flavio Pompermaier
, you can go with mapPartitions. Regards, Kamal -- Flavio Pompermaier *Development Department*___ *OKKAM**Srl **- www.okkam.it http://www.okkam.it/* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* pomperma...@okkam.it

Batch of updates

2014-10-27 Thread Flavio Pompermaier
Hi to all, I'm trying to convert my old mapreduce job to a spark one but I have some doubts.. My application basically buffers a batch of updates and every 100 elements it flushes the batch to a server. This is very easy in mapreduce but I don't know how you can do that in scala.. For example, if

Re: Dedup

2014-10-08 Thread Flavio Pompermaier
Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Multiple values may be different, yet

RE: Does HiveContext support Parquet?

2014-08-16 Thread Flavio Pompermaier
Hi to all, sorry for not being fully on topic but I have 2 quick questions about Parquet tables registered in Hive/sparq: 1) where are the created tables stored? 2) If I have multiple hiveContexts (one per application) using the same parquet table, is there any problem if inserting concurrently

Re: Save an RDD to a SQL Database

2014-08-07 Thread Flavio Pompermaier
Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Vida, What kind of database are you trying to write to? For example, I found that for loading into

Streaming on different store types

2014-07-30 Thread Flavio Pompermaier
Hi everybody, I have a scenario where I would like to stream data to different persistency types (i.e. sql db, graphdb ,hdfs, etc) and perform some filtering and trasformation as the the data comes in. The problem is to maintain consistency between all datastores (maybe some operation could fail)

Shark vs Impala

2014-06-22 Thread Flavio Pompermaier
Hi folks, I was looking at the benchmark provided by Cloudera at http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ . Is it real that Shark cannot execute some query if you don't have enough memory? And is it true/reliable that Impala

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
. -Soumya On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Thanks for the quick reply soumya. Unfortunately I'm a newbie with Spark..what do you mean? is there any reference to how to do that? On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta soumya.sima

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
you must not use, disclose, copy, print, distribute or rely on this email. On 19 June 2014 07:50, Flavio Pompermaier pomperma...@okkam.it wrote: Yes, I need to call the external service for every event and the order does not matter. There's no time limit in which each events should

Re: Spark streaming and rate limit

2014-06-19 Thread Flavio Pompermaier
to see how they manage it using several worker threads. My suggestion would be to knock-up a basic custom receiver and give it a shot! MC On 19 June 2014 09:31, Flavio Pompermaier pomperma...@okkam.it wrote: Hi Michael, thanks for the tip, it's really an elegant solution. What I'm still

Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number of contemporaneous call to that service (to avoid DoS) using Spark streaming? if so, limiting the rate implies a possible buffer growth...how can I control the

Re: Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
into Spark. This component can control in input rate to spark. On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Flavio Pompermaier
Is there a way to query fields by similarity (like Lucene or using a similarity metric) to be able to query something like WHERE language LIKE it~0.5 ? Best, Flavio On Thu, May 22, 2014 at 8:56 AM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Here is an illustrated example which

Re: Schema view of HadoopRDD

2014-05-16 Thread Flavio Pompermaier
Is there any Spark plugin/add-on that facilitate the query to a JSON content? Best, Flavio On Thu, May 15, 2014 at 6:53 PM, Michael Armbrust mich...@databricks.comwrote: Here is a link with more info: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html On Wed, May

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Flavio Pompermaier
Great work!thanks! On May 13, 2014 3:16 AM, zhen z...@latrobe.edu.au wrote: Hi Everyone, I found it quite difficult to find good examples for Spark RDD API calls. So my student and I decided to go through the entire API and write examples for the vast majority of API calls (basically

Re: RDD collect help

2014-04-18 Thread Flavio Pompermaier
it is discutable and it's more my personal opinion. 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser for closures?is there any problem with that? On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com

RDD collect help

2014-04-14 Thread Flavio Pompermaier
Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work

Re: Spark on YARN performance

2014-04-10 Thread Flavio Pompermaier
resources share cluster. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Apr 9, 2014 at 12:10 AM, Flavio Pompermaier pomperma...@okkam.itwrote: Hi to everybody, I'm new to Spark and I'd like to know if running

Re: Spark operators on Objects

2014-04-10 Thread Flavio Pompermaier
? Is there any suggestion about how to start? On Wed, Apr 9, 2014 at 11:37 PM, Flavio Pompermaier pomperma...@okkam.itwrote: Any help about this...? On Apr 9, 2014 9:19 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to everybody, In my current scenario I have complex objects stored

Spark on YARN performance

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, I'm new to Spark and I'd like to know if running Spark on top of YARN or Mesos could affect (and how much) its performance. Is there any doc about this? Best, Flavio

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier
at 9:57 AM, Flavio Pompermaier pomperma...@okkam.itwrote: Hi to everybody, in these days I looked a bit at the recent evolution of the big data stacks and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am I correct? Do you think that Spark and HBase should work together