Re: Few basic spark questions

2015-07-14 Thread Feynman Liang
You could implement the receiver as a Spark Streaming Receiver
;
the data received would be available for any streaming applications which
operate on DStreams (e.g. Streaming KMeans

).

On Tue, Jul 14, 2015 at 8:31 AM, Oded Maimon  wrote:

> Hi,
> Thanks for all the help.
> I'm still missing something very basic.
>
> If I wont use sparkR, which doesn't support streaming (will use mlib
> instead as Debasish suggested), and I have my scala receiver working, how
> the receiver should save the data in memory? I do see the store method, so
> if i use it, how can i read the data from a different spark scala/java
> application? how do i find/query this data?
>
>
> Regards,
> Oded Maimon
> Scene53.
>
> On Tue, Jul 14, 2015 at 12:35 AM, Feynman Liang 
> wrote:
>
>> Sorry; I think I may have used poor wording. SparkR will let you use R to
>> analyze the data, but it has to be loaded into memory using SparkR (see 
>> SparkR
>> DataSources
>> ).
>> You will still have to write a Java receiver to store the data into some
>> tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
>> performing the analysis.
>>
>> R specific questions such as windowing in R should go to R-help@; you
>> won't be able to use window since that is a Spark Streaming method.
>>
>> On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon  wrote:
>>
>>> You are helping me understanding stuff here a lot.
>>>
>>> I believe I have 3 last questions..
>>>
>>> If is use java receiver to get the data, how should I save it in memory?
>>> Using store command or other command?
>>>
>>> Once stored, how R can read that data?
>>>
>>> Can I use window command in R? I guess not because it is a streaming
>>> command, right? Any other way to window the data?
>>>
>>> Sent from IPhone
>>>
>>>
>>>
>>>
>>> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
>>> fli...@databricks.com> wrote:
>>>
>>>  If you use SparkR then you can analyze the data that's currently in
 memory with R; otherwise you will have to write to disk (eg HDFS).

 On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon  wrote:

> Thanks again.
> What I'm missing is where can I store the data? Can I store it in
> spark memory and then use R to analyze it? Or should I use hdfs? Any other
> places that I can save the data?
>
> What would you suggest?
>
> Thanks...
>
> Sent from IPhone
>
>
>
>
> On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
> fli...@databricks.com> wrote:
>
>  If you don't require true streaming processing and need to use R for
>> analysis, SparkR on a custom data source seems to fit your use case.
>>
>> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon 
>> wrote:
>>
>>> Hi, thanks for replying!
>>> I want to do the entire process in stages. Get the data using Java
>>> or scala because they are the only Langs that supports custom receivers,
>>> keep the data , use R to analyze it, keep the results
>>> , output the data to different systems.
>>>
>>> I thought that  can be spark memory using rdd or
>>> dstreams.. But could it be that I need to keep it in hdfs to make the
>>> entire process in stages?
>>>
>>> Sent from IPhone
>>>
>>>
>>>
>>>
>>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>>> fli...@databricks.com> wrote:
>>>
>>>  Hi Oded,

 I'm not sure I completely understand your question, but it sounds
 like you could have the READER receiver produce a DStream which is
 windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
 However, streaming in SparkR is not currently supported (SPARK-6803
 ) so I'm not too
 sure how ANALYZER would fit in.

 Feynman

 On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon 
 wrote:

> any help / idea will be appreciated :)
> thanks
>
>
> Regards,
> Oded Maimon
> Scene53.
>
> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon 
> wrote:
>
>> Hi All,
>> we are evaluating spark for real-time analytic. what we are
>> trying to do is the following:
>>
>>- READER APP- use custom receiver to get data from rabbitmq
>>(written in scala)
>>- ANALYZER APP - use spark R application to read the data
>>(windowed), analyze it every minute and save the results inside 
>> spark
>>- OUTPUT APP - user spark application (scala/java/python) to
>>read the results from R every 

Re: Few basic spark questions

2015-07-14 Thread Oded Maimon
Hi,
Thanks for all the help.
I'm still missing something very basic.

If I wont use sparkR, which doesn't support streaming (will use mlib
instead as Debasish suggested), and I have my scala receiver working, how
the receiver should save the data in memory? I do see the store method, so
if i use it, how can i read the data from a different spark scala/java
application? how do i find/query this data?


Regards,
Oded Maimon
Scene53.

On Tue, Jul 14, 2015 at 12:35 AM, Feynman Liang 
wrote:

> Sorry; I think I may have used poor wording. SparkR will let you use R to
> analyze the data, but it has to be loaded into memory using SparkR (see SparkR
> DataSources
> ).
> You will still have to write a Java receiver to store the data into some
> tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
> performing the analysis.
>
> R specific questions such as windowing in R should go to R-help@; you
> won't be able to use window since that is a Spark Streaming method.
>
> On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon  wrote:
>
>> You are helping me understanding stuff here a lot.
>>
>> I believe I have 3 last questions..
>>
>> If is use java receiver to get the data, how should I save it in memory?
>> Using store command or other command?
>>
>> Once stored, how R can read that data?
>>
>> Can I use window command in R? I guess not because it is a streaming
>> command, right? Any other way to window the data?
>>
>> Sent from IPhone
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
>> fli...@databricks.com> wrote:
>>
>>  If you use SparkR then you can analyze the data that's currently in
>>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>>
>>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon  wrote:
>>>
 Thanks again.
 What I'm missing is where can I store the data? Can I store it in spark
 memory and then use R to analyze it? Or should I use hdfs? Any other places
 that I can save the data?

 What would you suggest?

 Thanks...

 Sent from IPhone




 On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
 fli...@databricks.com> wrote:

  If you don't require true streaming processing and need to use R for
> analysis, SparkR on a custom data source seems to fit your use case.
>
> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon  wrote:
>
>> Hi, thanks for replying!
>> I want to do the entire process in stages. Get the data using Java or
>> scala because they are the only Langs that supports custom receivers, 
>> keep
>> the data , use R to analyze it, keep the results ,
>> output the data to different systems.
>>
>> I thought that  can be spark memory using rdd or
>> dstreams.. But could it be that I need to keep it in hdfs to make the
>> entire process in stages?
>>
>> Sent from IPhone
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>> fli...@databricks.com> wrote:
>>
>>  Hi Oded,
>>>
>>> I'm not sure I completely understand your question, but it sounds
>>> like you could have the READER receiver produce a DStream which is
>>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>>> However, streaming in SparkR is not currently supported (SPARK-6803
>>> ) so I'm not too
>>> sure how ANALYZER would fit in.
>>>
>>> Feynman
>>>
>>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon 
>>> wrote:
>>>
 any help / idea will be appreciated :)
 thanks


 Regards,
 Oded Maimon
 Scene53.

 On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon 
 wrote:

> Hi All,
> we are evaluating spark for real-time analytic. what we are trying
> to do is the following:
>
>- READER APP- use custom receiver to get data from rabbitmq
>(written in scala)
>- ANALYZER APP - use spark R application to read the data
>(windowed), analyze it every minute and save the results inside 
> spark
>- OUTPUT APP - user spark application (scala/java/python) to
>read the results from R every X minutes and send the data to few 
> external
>systems
>
> basically at the end i would like to have the READER COMPONENT as
> an app that always consumes the data and keeps it in spark,
> have as many ANALYZER COMPONENTS as my data scientists wants, and
> have one OUTPUT APP that will read the ANALYZER results and send it 
> to any
> relevant system.
>
> what is the right way to do it?
>
> Thanks,
> Oded.
>
>
>
>

Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't  havemost of the basic
analysis that you need on stream can be done through mllib components...
On Jul 13, 2015 2:35 PM, "Feynman Liang"  wrote:

> Sorry; I think I may have used poor wording. SparkR will let you use R to
> analyze the data, but it has to be loaded into memory using SparkR (see SparkR
> DataSources
> ).
> You will still have to write a Java receiver to store the data into some
> tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
> performing the analysis.
>
> R specific questions such as windowing in R should go to R-help@; you
> won't be able to use window since that is a Spark Streaming method.
>
> On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon  wrote:
>
>> You are helping me understanding stuff here a lot.
>>
>> I believe I have 3 last questions..
>>
>> If is use java receiver to get the data, how should I save it in memory?
>> Using store command or other command?
>>
>> Once stored, how R can read that data?
>>
>> Can I use window command in R? I guess not because it is a streaming
>> command, right? Any other way to window the data?
>>
>> Sent from IPhone
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
>> fli...@databricks.com> wrote:
>>
>>  If you use SparkR then you can analyze the data that's currently in
>>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>>
>>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon  wrote:
>>>
 Thanks again.
 What I'm missing is where can I store the data? Can I store it in spark
 memory and then use R to analyze it? Or should I use hdfs? Any other places
 that I can save the data?

 What would you suggest?

 Thanks...

 Sent from IPhone




 On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
 fli...@databricks.com> wrote:

  If you don't require true streaming processing and need to use R for
> analysis, SparkR on a custom data source seems to fit your use case.
>
> On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon  wrote:
>
>> Hi, thanks for replying!
>> I want to do the entire process in stages. Get the data using Java or
>> scala because they are the only Langs that supports custom receivers, 
>> keep
>> the data , use R to analyze it, keep the results ,
>> output the data to different systems.
>>
>> I thought that  can be spark memory using rdd or
>> dstreams.. But could it be that I need to keep it in hdfs to make the
>> entire process in stages?
>>
>> Sent from IPhone
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
>> fli...@databricks.com> wrote:
>>
>>  Hi Oded,
>>>
>>> I'm not sure I completely understand your question, but it sounds
>>> like you could have the READER receiver produce a DStream which is
>>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>>> However, streaming in SparkR is not currently supported (SPARK-6803
>>> ) so I'm not too
>>> sure how ANALYZER would fit in.
>>>
>>> Feynman
>>>
>>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon 
>>> wrote:
>>>
 any help / idea will be appreciated :)
 thanks


 Regards,
 Oded Maimon
 Scene53.

 On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon 
 wrote:

> Hi All,
> we are evaluating spark for real-time analytic. what we are trying
> to do is the following:
>
>- READER APP- use custom receiver to get data from rabbitmq
>(written in scala)
>- ANALYZER APP - use spark R application to read the data
>(windowed), analyze it every minute and save the results inside 
> spark
>- OUTPUT APP - user spark application (scala/java/python) to
>read the results from R every X minutes and send the data to few 
> external
>systems
>
> basically at the end i would like to have the READER COMPONENT as
> an app that always consumes the data and keeps it in spark,
> have as many ANALYZER COMPONENTS as my data scientists wants, and
> have one OUTPUT APP that will read the ANALYZER results and send it 
> to any
> relevant system.
>
> what is the right way to do it?
>
> Thanks,
> Oded.
>
>
>
>

 *This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they 
 are
 addressed. Please note that any disclosure, copying or distribution of 
 the
 content of thi

Re: Few basic spark questions

2015-07-13 Thread Feynman Liang
Sorry; I think I may have used poor wording. SparkR will let you use R to
analyze the data, but it has to be loaded into memory using SparkR (see SparkR
DataSources
).
You will still have to write a Java receiver to store the data into some
tabular datastore (e.g. Hive) before loading them as SparkR DataFrames and
performing the analysis.

R specific questions such as windowing in R should go to R-help@; you won't
be able to use window since that is a Spark Streaming method.

On Mon, Jul 13, 2015 at 2:23 PM, Oded Maimon  wrote:

> You are helping me understanding stuff here a lot.
>
> I believe I have 3 last questions..
>
> If is use java receiver to get the data, how should I save it in memory?
> Using store command or other command?
>
> Once stored, how R can read that data?
>
> Can I use window command in R? I guess not because it is a streaming
> command, right? Any other way to window the data?
>
> Sent from IPhone
>
>
>
>
> On Mon, Jul 13, 2015 at 2:07 PM -0700, "Feynman Liang" <
> fli...@databricks.com> wrote:
>
>  If you use SparkR then you can analyze the data that's currently in
>> memory with R; otherwise you will have to write to disk (eg HDFS).
>>
>> On Mon, Jul 13, 2015 at 1:45 PM, Oded Maimon  wrote:
>>
>>> Thanks again.
>>> What I'm missing is where can I store the data? Can I store it in spark
>>> memory and then use R to analyze it? Or should I use hdfs? Any other places
>>> that I can save the data?
>>>
>>> What would you suggest?
>>>
>>> Thanks...
>>>
>>> Sent from IPhone
>>>
>>>
>>>
>>>
>>> On Mon, Jul 13, 2015 at 1:41 PM -0700, "Feynman Liang" <
>>> fli...@databricks.com> wrote:
>>>
>>>  If you don't require true streaming processing and need to use R for
 analysis, SparkR on a custom data source seems to fit your use case.

 On Mon, Jul 13, 2015 at 1:06 PM, Oded Maimon  wrote:

> Hi, thanks for replying!
> I want to do the entire process in stages. Get the data using Java or
> scala because they are the only Langs that supports custom receivers, keep
> the data , use R to analyze it, keep the results ,
> output the data to different systems.
>
> I thought that  can be spark memory using rdd or dstreams..
> But could it be that I need to keep it in hdfs to make the entire process
> in stages?
>
> Sent from IPhone
>
>
>
>
> On Mon, Jul 13, 2015 at 12:07 PM -0700, "Feynman Liang" <
> fli...@databricks.com> wrote:
>
>  Hi Oded,
>>
>> I'm not sure I completely understand your question, but it sounds
>> like you could have the READER receiver produce a DStream which is
>> windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
>> However, streaming in SparkR is not currently supported (SPARK-6803
>> ) so I'm not too
>> sure how ANALYZER would fit in.
>>
>> Feynman
>>
>> On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon 
>> wrote:
>>
>>> any help / idea will be appreciated :)
>>> thanks
>>>
>>>
>>> Regards,
>>> Oded Maimon
>>> Scene53.
>>>
>>> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon 
>>> wrote:
>>>
 Hi All,
 we are evaluating spark for real-time analytic. what we are trying
 to do is the following:

- READER APP- use custom receiver to get data from rabbitmq
(written in scala)
- ANALYZER APP - use spark R application to read the data
(windowed), analyze it every minute and save the results inside 
 spark
- OUTPUT APP - user spark application (scala/java/python) to
read the results from R every X minutes and send the data to few 
 external
systems

 basically at the end i would like to have the READER COMPONENT as
 an app that always consumes the data and keeps it in spark,
 have as many ANALYZER COMPONENTS as my data scientists wants, and
 have one OUTPUT APP that will read the ANALYZER results and send it to 
 any
 relevant system.

 what is the right way to do it?

 Thanks,
 Oded.




>>>
>>> *This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. Please note that any disclosure, copying or distribution of 
>>> the
>>> content of this information is strictly forbidden. If you have received
>>> this email message in error, please destroy it immediately and notify 
>>> its
>>> sender.*
>>>
>>
>>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please n

Re: Few basic spark questions

2015-07-13 Thread Feynman Liang
Hi Oded,

I'm not sure I completely understand your question, but it sounds like you
could have the READER receiver produce a DStream which is
windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT.
However, streaming in SparkR is not currently supported (SPARK-6803
) so I'm not too sure how
ANALYZER would fit in.

Feynman

On Sun, Jul 12, 2015 at 11:23 PM, Oded Maimon  wrote:

> any help / idea will be appreciated :)
> thanks
>
>
> Regards,
> Oded Maimon
> Scene53.
>
> On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon  wrote:
>
>> Hi All,
>> we are evaluating spark for real-time analytic. what we are trying to do
>> is the following:
>>
>>- READER APP- use custom receiver to get data from rabbitmq (written
>>in scala)
>>- ANALYZER APP - use spark R application to read the data (windowed),
>>analyze it every minute and save the results inside spark
>>- OUTPUT APP - user spark application (scala/java/python) to read the
>>results from R every X minutes and send the data to few external systems
>>
>> basically at the end i would like to have the READER COMPONENT as an app
>> that always consumes the data and keeps it in spark,
>> have as many ANALYZER COMPONENTS as my data scientists wants, and have
>> one OUTPUT APP that will read the ANALYZER results and send it to any
>> relevant system.
>>
>> what is the right way to do it?
>>
>> Thanks,
>> Oded.
>>
>>
>>
>>
>
> *This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. Please note that any disclosure, copying or distribution of the
> content of this information is strictly forbidden. If you have received
> this email message in error, please destroy it immediately and notify its
> sender.*
>


Re: Few basic spark questions

2015-07-12 Thread Oded Maimon
any help / idea will be appreciated :)
thanks


Regards,
Oded Maimon
Scene53.

On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon  wrote:

> Hi All,
> we are evaluating spark for real-time analytic. what we are trying to do
> is the following:
>
>- READER APP- use custom receiver to get data from rabbitmq (written
>in scala)
>- ANALYZER APP - use spark R application to read the data (windowed),
>analyze it every minute and save the results inside spark
>- OUTPUT APP - user spark application (scala/java/python) to read the
>results from R every X minutes and send the data to few external systems
>
> basically at the end i would like to have the READER COMPONENT as an app
> that always consumes the data and keeps it in spark,
> have as many ANALYZER COMPONENTS as my data scientists wants, and have one
> OUTPUT APP that will read the ANALYZER results and send it to any relevant
> system.
>
> what is the right way to do it?
>
> Thanks,
> Oded.
>
>
>
>

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


Few basic spark questions

2015-07-12 Thread Oded Maimon
Hi All,
we are evaluating spark for real-time analytic. what we are trying to do is
the following:

   - READER APP- use custom receiver to get data from rabbitmq (written in
   scala)
   - ANALYZER APP - use spark R application to read the data (windowed),
   analyze it every minute and save the results inside spark
   - OUTPUT APP - user spark application (scala/java/python) to read the
   results from R every X minutes and send the data to few external systems

basically at the end i would like to have the READER COMPONENT as an app
that always consumes the data and keeps it in spark,
have as many ANALYZER COMPONENTS as my data scientists wants, and have one
OUTPUT APP that will read the ANALYZER results and send it to any relevant
system.

what is the right way to do it?

Thanks,
Oded.

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


Re: Spark Questions

2014-07-14 Thread Gonzalo Zarza
Thanks for your answers Shuo Xiang and Aaron Davidson!

Regards,


--
*Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist |
*GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494
 | [image: Facebook]  [image: Twitter]
 [image: Youtube]
 [image: Linkedin]
 [image: Pinterest]
 [image: Globant] 


On Sat, Jul 12, 2014 at 9:02 PM, Aaron Davidson  wrote:

> I am not entirely certain I understand your questions, but let me assume
> you are mostly interested in SparkSQL and are thinking about your problem
> in terms of SQL-like tables.
>
> 1. Shuo Xiang mentioned Spark partitioning strategies, but in case you are
> talking about data partitioning or sharding as exist in Hive, SparkSQL does
> not currently support this, though it is on the roadmap. We can read from
> partitioned Hive tables, however.
>
> 2. If by entries/record you mean something like columns/row, SparkSQL does
> allow you to project out the columns you want, or select all columns. The
> efficiency of such a projection is determined by the how the data is
> stored, however: If your data is stored in an inherently row-based format,
> this projection will be no faster than doing an initial map() over the data
> to only select the desired columns. If it's stored in something like
> Parquet, or cached in memory, however, we would avoid ever looking at the
> unused columns.
>
> 3. Spark has a very generalized data source API, so it is capable of
> interacting with whatever data source. However, I don't think we currently
> have any SparkSQL connectors to RDBMSes that would support column pruning
> or other push-downs. This is all very much viable, however.
>
>
> On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza 
> wrote:
>
>> Hi all,
>>
>> We've been evaluating Spark for a long-term project. Although we've been
>> reading several topics in forum, any hints on the following topics we'll be
>> extremely welcomed:
>>
>> 1. Which are the data partition strategies available in Spark? How
>> configurable are these strategies?
>>
>> 2. How would be the best way to use Spark if queries can touch only 3-5
>> entries/records? Which strategy is the best if they want to perform a full
>> scan of the entries?
>>
>> 3. Is Spark capable of interacting with RDBMS?
>>
>> Thanks a lot!
>>
>> Best regards,
>>
>> --
>> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data
>> Specialist |
>> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
>> 15494 | [image: Facebook]  [image:
>> Twitter]  [image: Youtube]
>>  [image: Linkedin]
>>  [image: Pinterest]
>>  [image: Globant]
>> 
>>
>
>


Re: Spark Questions

2014-07-12 Thread Aaron Davidson
I am not entirely certain I understand your questions, but let me assume
you are mostly interested in SparkSQL and are thinking about your problem
in terms of SQL-like tables.

1. Shuo Xiang mentioned Spark partitioning strategies, but in case you are
talking about data partitioning or sharding as exist in Hive, SparkSQL does
not currently support this, though it is on the roadmap. We can read from
partitioned Hive tables, however.

2. If by entries/record you mean something like columns/row, SparkSQL does
allow you to project out the columns you want, or select all columns. The
efficiency of such a projection is determined by the how the data is
stored, however: If your data is stored in an inherently row-based format,
this projection will be no faster than doing an initial map() over the data
to only select the desired columns. If it's stored in something like
Parquet, or cached in memory, however, we would avoid ever looking at the
unused columns.

3. Spark has a very generalized data source API, so it is capable of
interacting with whatever data source. However, I don't think we currently
have any SparkSQL connectors to RDBMSes that would support column pruning
or other push-downs. This is all very much viable, however.


On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza 
wrote:

> Hi all,
>
> We've been evaluating Spark for a long-term project. Although we've been
> reading several topics in forum, any hints on the following topics we'll be
> extremely welcomed:
>
> 1. Which are the data partition strategies available in Spark? How
> configurable are these strategies?
>
> 2. How would be the best way to use Spark if queries can touch only 3-5
> entries/records? Which strategy is the best if they want to perform a full
> scan of the entries?
>
> 3. Is Spark capable of interacting with RDBMS?
>
> Thanks a lot!
>
> Best regards,
>
> --
> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist
> |
> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
> 15494 | [image: Facebook]  [image:
> Twitter]  [image: Youtube]
>  [image: Linkedin]
>  [image: Pinterest]
>  [image: Globant] 
>


Re: Spark Questions

2014-07-12 Thread Shuo Xiang
For your first question, the partitioning strategy can be tuned by applying
different partitioner. You can use existing ones such as HashPartitioner or
write your own.See this link(
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf)
for some instructions.



On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza 
wrote:

> Hi all,
>
> We've been evaluating Spark for a long-term project. Although we've been
> reading several topics in forum, any hints on the following topics we'll be
> extremely welcomed:
>
> 1. Which are the data partition strategies available in Spark? How
> configurable are these strategies?
>
> 2. How would be the best way to use Spark if queries can touch only 3-5
> entries/records? Which strategy is the best if they want to perform a full
> scan of the entries?
>
> 3. Is Spark capable of interacting with RDBMS?
>
> Thanks a lot!
>
> Best regards,
>
> --
> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist
> |
> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
> 15494 | [image: Facebook]  [image:
> Twitter]  [image: Youtube]
>  [image: Linkedin]
>  [image: Pinterest]
>  [image: Globant] 
>


Spark Questions

2014-07-11 Thread Gonzalo Zarza
Hi all,

We've been evaluating Spark for a long-term project. Although we've been
reading several topics in forum, any hints on the following topics we'll be
extremely welcomed:

1. Which are the data partition strategies available in Spark? How
configurable are these strategies?

2. How would be the best way to use Spark if queries can touch only 3-5
entries/records? Which strategy is the best if they want to perform a full
scan of the entries?

3. Is Spark capable of interacting with RDBMS?

Thanks a lot!

Best regards,

--
*Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist |
*GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494
 | [image: Facebook]  [image: Twitter]
 [image: Youtube]
 [image: Linkedin]
 [image: Pinterest]
 [image: Globant] 


RE: Basic Scala and Spark questions

2014-06-24 Thread Sameer Tilak
Hi there,Here is how I specify it during the compilation.
scalac -classpath 
/apps/software/abc.jar:/apps/software/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:spark-assembly-1.0.0-hadoop1.0.4.jar/datanucleus-core-3.2.2.jar
 Score.scala
Then I generate a jar file out of it say myapp.
Finally, to run this I do the following:
 ./spark-shell --jars /apps/software/abc.jar,/apps/software/myapp/myapp.jar

Hope this helps.
From: vmuttin...@ebay.com
To: user@spark.apache.org; u...@spark.incubator.apache.org
Subject: RE: Basic Scala and Spark questions
Date: Tue, 24 Jun 2014 20:06:04 +









Hello Tilak,
1. I get error Not found: type RDD error. Can someone please tell me which jars 
do I need to add as external jars and what dhoulf I add iunder import 
statements so that this error will go
 away. 
Do you not see any issues with the import statements?

Add the spark-assembly-1.0.0-hadoop2.2.0.jar file as a dependency.
You can download Spark from here (http://spark.apache.org/downloads.html). 
You’ll find the above mentioned
 jar in the lib folder. 
Import Statement: import org.apache.spark.rdd.RDD



From: Sameer Tilak [mailto:ssti...@live.com]


Sent: Monday, June 23, 2014 10:38 AM

To: u...@spark.incubator.apache.org

Subject: Basic Scala and Spark questions


 

Hi All,
I am new so Scala and Spark. I have a basic question. I have the following 
import statements in my Scala program. I want to pass my function (printScore)
 to Spark. It will compare a string 
 
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
/* import thirdparty jars */
  
I have the following method in my Scala class:
 
class DistanceClass
{
val ta = new textAnalytics();
 
def printScore(sourceStr: String, rdd:
RDD[String]) 
{
 
// Third party jars have StringWrapper
val
str1 =
new StringWrapper (sourceStr)
val ta_ = this.ta;
 
rdd.map(str1, x => ta_.score(str1, StringWrapper(x))
   
}
 
I am using Eclipse for development. I have the following questions:
1. I get error Not found: type RDD error. Can someone please tell me which jars 
do I need to add as external jars and what dhoulf I add iunder import 
statements so that this error will go
 away. 
2. Also, including StringWrapper(x) inside map, will that be OK? rdd.map(str1, 
x => ta_.score(str1, StringWrapper(x))

 


  

RE: Basic Scala and Spark questions

2014-06-24 Thread Muttineni, Vinay
Hello Tilak,
1. I get error Not found: type RDD error. Can someone please tell me which jars 
do I need to add as external jars and what dhoulf I add iunder import 
statements so that this error will go away.
Do you not see any issues with the import statements?
Add the spark-assembly-1.0.0-hadoop2.2.0.jar file as a dependency.
You can download Spark from here (http://spark.apache.org/downloads.html). 
You'll find the above mentioned jar in the lib folder.
Import Statement: import org.apache.spark.rdd.RDD
From: Sameer Tilak [mailto:ssti...@live.com]
Sent: Monday, June 23, 2014 10:38 AM
To: u...@spark.incubator.apache.org
Subject: Basic Scala and Spark questions

Hi All,
I am new so Scala and Spark. I have a basic question. I have the following 
import statements in my Scala program. I want to pass my function (printScore) 
to Spark. It will compare a string

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
/* import thirdparty jars */

I have the following method in my Scala class:

class DistanceClass
{
val ta = new textAnalytics();

def printScore(sourceStr: String, rdd: RDD[String])
{

// Third party jars have StringWrapper
val str1 = new StringWrapper (sourceStr)
val ta_ = this.ta;

rdd.map(str1, x => ta_.score(str1, StringWrapper(x))

}

I am using Eclipse for development. I have the following questions:
1. I get error Not found: type RDD error. Can someone please tell me which jars 
do I need to add as external jars and what dhoulf I add iunder import 
statements so that this error will go away.
2. Also, including StringWrapper(x) inside map, will that be OK? rdd.map(str1, 
x => ta_.score(str1, StringWrapper(x))



RE: Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
Hi All,I was able to solve both these issues. Thanks!
Just FYI:
For 1:







import org.apache.spark.rdd;
import org.apache.spark.rdd.RDD;
For 2: 







rdd.map(x => jc_.score(str1, new StringWrapper(x))) From: 
ssti...@live.com
To: u...@spark.incubator.apache.org
Subject: Basic Scala and Spark questions
Date: Mon, 23 Jun 2014 10:38:04 -0700




















Hi All,I am new so Scala and Spark. I have a basic question. I have the 
following import statements in my Scala program. I want to pass my function 
(printScore) to Spark. It will compare a string 
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf/* import thirdparty jars */  I have the 
following method in my Scala class:
class DistanceClass{val ta = new textAnalytics();
def printScore(sourceStr: String, rdd: RDD[String]) {
// Third party jars have StringWrapper  val str1 = new StringWrapper 
(sourceStr)val ta_ = this.ta;
rdd.map(str1, x => ta_.score(str1, StringWrapper(x))   }
I am using Eclipse for development. I have the following questions:1. I get 
error Not found: type RDD error. Can someone please tell me which jars do I 
need to add as external jars and what dhoulf I add iunder import statements so 
that this error will go away. 2. Also, including StringWrapper(x) inside map, 
will that be OK? rdd.map(str1, x => ta_.score(str1, StringWrapper(x))

  

Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
















Hi All,I am new so Scala and Spark. I have a basic question. I have the 
following import statements in my Scala program. I want to pass my function 
(printScore) to Spark. It will compare a string 
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf/* import thirdparty jars */  I have the 
following method in my Scala class:
class DistanceClass{val ta = new textAnalytics();
def printScore(sourceStr: String, rdd: RDD[String]) {
// Third party jars have StringWrapper  val str1 = new StringWrapper 
(sourceStr)val ta_ = this.ta;
rdd.map(str1, x => ta_.score(str1, StringWrapper(x))   }
I am using Eclipse for development. I have the following questions:1. I get 
error Not found: type RDD error. Can someone please tell me which jars do I 
need to add as external jars and what dhoulf I add iunder import statements so 
that this error will go away. 2. Also, including StringWrapper(x) inside map, 
will that be OK? rdd.map(str1, x => ta_.score(str1, StringWrapper(x))