Re: Do we need to kill a spark job every time we change and deploy it?

2018-11-28 Thread Irving Duran
Are you referring to have spark picking up a new jar build?  If so, you can
probably script that on bash.

Thank You,

Irving Duran


On Wed, Nov 28, 2018 at 12:44 PM Mina Aslani  wrote:

> Hi,
>
> I have a question for you.
> Do we need to kill a spark job every time we change and deploy it to
> cluster? Or, is there a way for Spark to automatically pick up the recent
> jar version?
>
> Best regards,
> Mina
>


Re: spark-shell doesn't start

2018-06-19 Thread Irving Duran
You are trying to run "spark-shell" as a command which is not in your
environment.  You might want to do "./spark-shell" or try "sudo ln -s
/path/to/spark-shell /usr/bin/spark-shell" and then do "spark-shell".

Thank You,

Irving Duran


On Sun, Jun 17, 2018 at 6:53 AM Raymond Xie  wrote:

> Hello, I am doing the practice in Ubuntu now, here is the error I am
> encountering:
>
>
> rxie@ubuntu:~/Downloads/spark/bin$ spark-shell
> Error: Could not find or load main class org.apache.spark.launcher.Main
>
>
> What am I missing?
>
> Thank you very much.
>
> Java is installed.
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Re: [Spark] Supporting python 3.5?

2018-06-19 Thread Irving Duran
Cool, thanks for the validation!

Thank You,

Irving Duran


On Thu, May 24, 2018 at 8:20 PM Jeff Zhang  wrote:

>
> It supports python 3.5, and IIRC, spark also support python 3.6
>
> Irving Duran 于2018年5月10日周四 下午9:08写道:
>
>> Does spark now support python 3.5 or it is just 3.4.x?
>>
>> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>>
>> Thank You,
>>
>> Irving Duran
>>
>


Re: [announce] BeakerX supports Scala+Spark in Jupyter

2018-06-07 Thread Irving Duran
So would you recommend not to have Toree and BeakerX installed to avoid
conflicts?

Thank you,

Irving Duran

On 06/07/2018 07:55 PM, s...@draves.org wrote:
> The %%spark magic comes with BeakerX's Scala kernel, not related to Toree.
>
> On Thu, Jun 7, 2018, 8:51 PM Stephen Boesch  <mailto:java...@gmail.com>> wrote:
>
> Assuming that the spark 2.X kernel (e.g. toree) were chosen for a
> given jupyter notebook and there is a  Cell 3 that contains some
> Spark DataFrame operations .. Then :
>
>   * what is the relationship  does the %%spark  magic and the
> toree kernel?
>   * how does the %%spark magic get applied to that other Cell 3 ?
>
> thanks!
>
> 2018-06-07 16:33 GMT-07:00 s...@draves.org
> <mailto:s...@draves.org> mailto:s...@draves.org>>:
>
> We are pleased to announce release 0.19.0 of BeakerX
> <http://BeakerX.com>, a collection of extensions and kernels
> for Jupyter and Jupyter Lab.
>
> BeakerX now features Scala+Spark integration including GUI
> configuration, status, progress, interrupt, and interactive
> tables.
>
> We are very interested in your feedback about what remains to
> be done.  You may reach by github and gitter, as documented in
> the readme: https://github.com/twosigma/beakerx
>
> Thanks, -Scott
>
> spark.png
> ​
>
> -- 
> BeakerX.com <http://BeakerX.com>
> ScottDraves.com <http://www.ScottDraves.com>
> @Scott_Draves <http://twitter.com/scott_draves>
>
>



signature.asc
Description: OpenPGP digital signature


Re: If there is timestamp type data in DF, Spark 2.3 toPandas is much slower than spark 2.2.

2018-06-07 Thread Irving Duran
I haven't noticed or seen this behavior.  Have you noticed this with by
testing the same dataset between versions?

Thank you,

Irving Duran

On 06/06/2018 11:22 PM, 李斌松 wrote:
> If there is timestamp type data in DF, Spark 2.3 toPandas is much
> slower than spark 2.2.



signature.asc
Description: OpenPGP digital signature


Re: Apache Spark Installation error

2018-05-31 Thread Irving Duran
You probably want to recognize "spark-shell" as a command in your
environment.  Maybe try "sudo ln -s /path/to/spark-shell
/usr/bin/spark-shell"  Have you tried "./spark-shell" in the current path
to see if it works?

Thank You,

Irving Duran


On Thu, May 31, 2018 at 9:00 AM Remil Mohanan  wrote:

> Hi there,
>
>I am not able to execute the spark-shell command. Can you please help.
>
> Thanks
>
> Remil
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Irving Duran
Unless you want to get a count, yes.

Thank You,

Irving Duran


On Tue, May 29, 2018 at 1:44 PM Chetan Khatri 
wrote:

> Georg, I just want to double check that someone wrote MSSQL Server script
> where it's groupby all columns. What is alternate best way to do distinct
> all columns ?
>
>
>
> On Wed, May 30, 2018 at 12:08 AM, Georg Heiler 
> wrote:
>
>> Why do you group if you do not want to aggregate?
>> Isn't this the same as select distinct?
>>
>> Chetan Khatri  schrieb am Di., 29. Mai 2018
>> um 20:21 Uhr:
>>
>>> All,
>>>
>>> I have scenario like this in MSSQL Server SQL where i need to do groupBy
>>> without Agg function:
>>>
>>> Pseudocode:
>>>
>>>
>>> select m.student_id, m.student_name, m.student_std, m.student_group,
>>> m.student_d
>>> ob from student as m inner join general_register g on m.student_id =
>>> g.student_i
>>> d group by m.student_id, m.student_name, m.student_std, m.student_group,
>>> m.student_dob
>>>
>>> I tried to doing in spark but i am not able to get Dataframe as return
>>> value, how this kind of things could be done in Spark.
>>>
>>> Thanks
>>>
>>
>


[Spark] Supporting python 3.5?

2018-05-10 Thread Irving Duran
Does spark now support python 3.5 or it is just 3.4.x?

https://spark.apache.org/docs/latest/rdd-programming-guide.html

Thank You,

Irving Duran


Re: [pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Irving Duran
I could be wrong, but I think you can do a wild card.

df = spark.read.format('csv').load('/path/to/file*.csv.gz')

Thank You,

Irving Duran


On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
>
> I want to read multiple files parallely into 1 dataframe. But the files
> have random names and cannot confirm to any pattern (so I can't use
> wildcard). Also, the files can be in different directories.
> If I provide the file names in a list to the dataframe reader, it reads
> then sequentially.
> Eg:
> df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
> This reads the files sequentially. What can I do to read the files
> parallely?
> I noticed that spark reads files parallely if provided directly the
> directory location. How can that be extended to multiple random files?
> Suppose if my system has 4 cores, how can I make spark read 4 files at a
> time?
>
> Please suggest.
>


Re: ML Linear and Logistic Regression - Poor Performance

2018-05-02 Thread Irving Duran
May want to think about reducing the number of iterations.  Right now you
have it set at 500.

Thank You,

Irving Duran


On Fri, Apr 27, 2018 at 7:15 PM Thodoris Zois <z...@ics.forth.gr> wrote:

> I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code.
> Logistic regression took 85 minutes and linear regression 127 seconds…
>
> My dataset as I said is 128 MB and contains: 1000 features and ~100
> classes.
>
>
> #SparkSession
> ss = SparkSession.builder.getOrCreate()
>
>
> start = time.time()
>
> #Read data
> trainData = ss.read.format("csv").option("inferSchema","true").load(file)
>
> #Calculate Features
> assembler = VectorAssembler(inputCols=trainData.columns[1:], outputCol=
> "features")
> trainData = assembler.transform(trainData)
>
> #Drop columns
> dropColumns = trainData.columns
> dropColumns = [e for e in dropColumns if e not in ('_c0', 'features')]
> trainData = trainData.drop(*dropColumns)
>
> #Rename column from _c0 to label
> trainData = trainData.withColumnRenamed("_c0", "label")
>
> #Logistic regression
> lr = LogisticRegression(maxIter=500, regParam=0.3, elasticNetParam=0.8)
> lrModel = lr.fit(trainData)
>
> #Output Coefficients
> print("Coefficients: " + str(lrModel.coefficientMatrix))
>
>
>
> - Thodoris
>
>
> On 27 Apr 2018, at 22:50, Irving Duran <irving.du...@gmail.com> wrote:
>
> Are you reformatting the data correctly for logistic (meaning 0 & 1's)
> before modeling?  What are OS and spark version you using?
>
> Thank You,
>
> Irving Duran
>
>
> On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois <z...@ics.forth.gr> wrote:
>
>> Hello,
>>
>> I am running an experiment to test logistic and linear regression on
>> spark using MLlib.
>>
>> My dataset is only 128MB and something weird happens. Linear regression
>> takes about 127 seconds either with 1 or 500 iterations. On the other hand,
>> logistic regression most of the times does not manage to finish either with
>> 1 iteration. I usually get memory heap error.
>>
>> In both cases I use the default cores and memory for driver and I spawn 1
>> executor with 1 core and 2GBs of memory.
>>
>> Except that, I get a warning about NativeBLAS. I searched in the Internet
>> and I found that I have to install libgfortran. Even if I did it the
>> warning remains.
>>
>> Any ideas for the above?
>>
>> Thank you,
>> - Thodoris
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Irving Duran
I don't think there is a magic number, so I would say that it will depend
on how big your dataset is and the size of your worker(s).

Thank You,

Irving Duran


On Sat, Apr 28, 2018 at 10:41 AM klrmowse <klrmo...@gmail.com> wrote:

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Irving Duran
Are you reformatting the data correctly for logistic (meaning 0 & 1's)
before modeling?  What are OS and spark version you using?

Thank You,

Irving Duran


On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois <z...@ics.forth.gr> wrote:

> Hello,
>
> I am running an experiment to test logistic and linear regression on spark
> using MLlib.
>
> My dataset is only 128MB and something weird happens. Linear regression
> takes about 127 seconds either with 1 or 500 iterations. On the other hand,
> logistic regression most of the times does not manage to finish either with
> 1 iteration. I usually get memory heap error.
>
> In both cases I use the default cores and memory for driver and I spawn 1
> executor with 1 core and 2GBs of memory.
>
> Except that, I get a warning about NativeBLAS. I searched in the Internet
> and I found that I have to install libgfortran. Even if I did it the
> warning remains.
>
> Any ideas for the above?
>
> Thank you,
> - Thodoris
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Can spark handle this scenario?

2018-02-16 Thread Irving Duran
Do you only want to use Scala? Because otherwise, I think with pyspark
and pandas read table you should be able to accomplish what you want to
accomplish.

Thank you,

Irving Duran

On 02/16/2018 06:10 PM, Lian Jiang wrote:
> Hi,
>
> I have a user case:
>
> I want to download S stock data from Yahoo API in parallel using
> Spark. I have got all stock symbols as a Dataset. Then I used below
> code to call Yahoo API for each symbol:
>
>        
>
> case class Symbol(symbol: String, sector: String)
>
> case class Tick(symbol: String, sector: String, open: Double, close:
> Double)
>
>
> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]
>
>
>     symbolDs.map { k =>
>
>       pullSymbolFromYahoo(k.symbol, k.sector)
>
>     }
>
>
> This statement cannot compile:
>
>
> Unable to find encoder for type stored in a Dataset.  Primitive types
> (Int, String, etc) and Product types (case classes) are supported by
> importing spark.implicits._  Support for serializing other types will
> be added in future releases.
>
>
>
> My questions are:
>
>
> 1. As you can see, this scenario is not traditional dataset handling
> such as count, sql query... Instead, it is more like a UDF which apply
> random operation on each record. Is Spark good at handling such scenario?
>
>
> 2. Regarding the compilation error, any fix? I did not find a
> satisfactory solution online.
>
>
> Thanks for help!
>
>
>
>



signature.asc
Description: OpenPGP digital signature


Re: Do we always need to go through spark-submit?

2017-08-30 Thread Irving Duran
I don't know how this would work, but maybe your .jar calls spark-submit
from within your jar if you were to compile the jar with the spark-submit
class.


Thank You,

Irving Duran

On Wed, Aug 30, 2017 at 10:57 AM, kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I understand spark-submit sets up its own class loader and other things
> but I am wondering if it  is possible to just compile the code and run it
> using "java -jar mysparkapp.jar" ?
>
> Thanks,
> kant
>


Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread Irving Duran
I think it will work.  Might want to explore spark streams.


Thank You,

Irving Duran

On Wed, Aug 30, 2017 at 10:50 AM, <kanth...@gmail.com> wrote:

> I don't see why not
>
> Sent from my iPhone
>
> > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov <
> alexandr.poru...@gmail.com> wrote:
> >
> > Hello,
> >
> > I am new in Apache Spark. I need to process different time series data
> (numeric values which depend on time) and react on next actions:
> > 1. Data is changing up or down too fast.
> > 2. Data is changing constantly up or down too long.
> >
> > For example, if the data have changed 30% up or down in the last five
> minutes (or less), then I need to send a special event.
> > If the data have changed 50% up or down in two hours (or less), then I
> need to send a special event.
> >
> > Frequency of data changing is about 1000-3000 per second. And I need to
> react as soon as possible.
> >
> > Does Apache Spark fit well for this scenario or I need to search for
> another solution?
> > Sorry for stupid question, but I am a total newbie.
> >
> > Regards
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Irving Duran
I think there is a difference between the actual value in the cell and what
Excel formats that cell.  You probably want to import that field as a
string or not have it as a date format in Excel.

Just a thought


Thank You,

Irving Duran

On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu <aakash.spark@gmail.com>
wrote:

> Hey all,
>
> Forgot to attach the link to the overriding Schema through external
> package's discussion.
>
> https://github.com/crealytics/spark-excel/pull/13
>
> You can see my comment there too.
>
> Thanks,
> Aakash.
>
> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <aakash.spark@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>> fetch data from an excel file using
>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>> double for a date type column.
>>
>> The detailed description is given here (the question I posted) -
>>
>> https://stackoverflow.com/questions/45713699/inferschema-
>> using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>
>>
>> Found it is a probable bug with the crealytics excel read package.
>>
>> Can somebody help me with a workaround for this?
>>
>> Thanks,
>> Aakash.
>>
>
>


Re: ALSModel.load not working on pyspark 2.1.0

2017-07-31 Thread Irving Duran
I think the problem is because you are calling "model2 =
ALSModel.load("/models/als")" instead of "model2 =
*model*.load("/models/als")".
See my working sample below.

>>> model.save('/models/als.test')
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
>>> model2 = model.load('/models/als.test')
>>> model
ALS_4324a1082d889dd1f0e4
>>> model2
ALS_4324a1082d889dd1f0e4


Thank You,

Irving Duran

On Sat, Jul 29, 2017 at 2:57 PM, Cristian Garcia <cgarcia@gmail.com>
wrote:

> This code is not working:
>
> 
> from pyspark.ml.evaluation import RegressionEvaluator
> from pyspark.ml.recommendation import ALS, ALSModel
> from pyspark.sql import Row
>
> als = ALS(maxIter=10, regParam=0.01, userCol="user_id",
> itemCol="movie_id", ratingCol="rating")
> model = als.fit(training)
>
> model.save("/models/als")
>
> model2 = ALSModel.load("/models/als") # <-- error here
> =
>
>
>
> Gives rise to this error:
> =
>
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()> 1 m2 = 
> ALSModel.load("/models/als")
> /usr/local/spark/python/pyspark/ml/util.py in load(cls, path)251 def 
> load(cls, path):252 """Reads an ML instance from the input path, 
> a shortcut of `read().load(path)`."""--> 253 return 
> cls.read().load(path)254 255
> /usr/local/spark/python/pyspark/ml/util.py in load(self, path)192 
> if not isinstance(path, basestring):193 raise TypeError("path 
> should be a basestring, got type %s" % type(path))--> 194 java_obj = 
> self._jread.load(path)195 if not hasattr(self._clazz, 
> "_from_java"):196 raise NotImplementedError("This Java ML 
> type cannot be loaded into Python currently: %r"
> /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)   1131 answer = 
> self.gateway_client.send_command(command)   1132 return_value = 
> get_return_value(-> 1133 answer, self.gateway_client, 
> self.target_id, self.name)   11341135 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def 
> deco(*a, **kw): 62 try:---> 63 return f(*a, **kw) 
> 64 except py4j.protocol.Py4JJavaError as e: 65 s = 
> e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)317  
>raise Py4JJavaError(318 "An error occurred while 
> calling {0}{1}{2}.\n".--> 319 format(target_id, ".", 
> name), value)320 else:321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o337.load.
> : java.lang.UnsupportedOperationException: empty collection
>   at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1370)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1367)
>   at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:379)
>   at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:317)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
>
> =
>


Re: Scala, Python or Java for Spark programming

2017-06-13 Thread Irving Duran
gt;>>>>> Java is often underestimated, because people are not aware of its
>>>>>> lambda functionality which makes the code very readable. Scala - it 
>>>>>> depends
>>>>>> who programs it. People coming with the normal Java background write
>>>>>> Java-like code in scala which might not be so good. People from a
>>>>>> functional background write it more functional like - i.e. You have a lot
>>>>>> of things in one line of code which can be a curse even for other
>>>>>> functional programmers, especially if the application is distributed as 
>>>>>> in
>>>>>> the case of Spark. Usually no comment is provided and you have - even as 
>>>>>> a
>>>>>> functional programmer - to do a lot of drill down. Python is somehow
>>>>>> similar, but since it has no connection with Java you do not have these
>>>>>> extremes. There it depends more on the community (e.g. Medical, 
>>>>>> financials)
>>>>>> and skills of people how the code look likes.
>>>>>> However the difficulty comes with the distributed applications behind
>>>>>> Spark which may have unforeseen side effects if the users do not know 
>>>>>> this,
>>>>>> ie if they have never been used to parallel programming.
>>>>>>
>>>>>> On 7. Jun 2017, at 17:20, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am a fan of Scala and functional programming hence I prefer Scala.
>>>>>>
>>>>>> I had a discussion with a hardcore Java programmer and a data
>>>>>> scientist who prefers Python.
>>>>>>
>>>>>> Their view is that in a collaborative work using Scala programming it
>>>>>> is almost impossible to understand someone else's Scala code.
>>>>>>
>>>>>> Hence I was wondering how much truth is there in this statement.
>>>>>> Given that Spark uses Scala as its core development language, what is the
>>>>>> general view on the use of Scala, Python or Java?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> --
Thank You,

Irving Duran


Re: Adding header to an rdd before saving to text file

2017-06-06 Thread Irving Duran
Not a best option, but I've done this before. If you know the columns
structure you could manually write them to the file before exporting.

On Tue, Jun 6, 2017 at 12:39 AM 颜发才(Yan Facai) <facai@gmail.com> wrote:

> Hi, upendra.
> It will be easier to use DataFrame to read/save csv file with header, if
> you'd like.
>
> On Tue, Jun 6, 2017 at 5:15 AM, upendra 1991 <
> upendra1...@yahoo.com.invalid> wrote:
>
>> I am reading a CSV(file has headers header 1st,header2) and generating
>> rdd,
>> After few transformations I create an rdd and finally write it to a txt
>> file.
>>
>> What's the best way to add the header from source file, into rdd and have
>> it available as header into new file I.e, when I transform the rdd into
>> textfile using saveAsTexFile("newfile") the header 1, header 2 shall be
>> available.
>>
>>
>> Thanks,
>> Upendra
>>
>
> --
Thank You,

Irving Duran


Re: Edge Node in Spark

2017-06-06 Thread Irving Duran
Where in the documentation did you find "edge node"? Spark would call it
worker or executor, but not "edge node".  Her is some info about yarn logs
-> https://spark.apache.org/docs/latest/running-on-yarn.html.


Thank You,

Irving Duran

On Tue, Jun 6, 2017 at 11:48 AM, Ashok Kumar <ashok34...@yahoo.com> wrote:

> Just Straight Spark please.
>
> Also if I run a spark job using Python or Scala using Yarn where the log
> files are kept in the edge node?  Are these under logs directory for yarn?
>
> thanks
>
>
> On Tuesday, 6 June 2017, 14:11, Irving Duran <irving.du...@gmail.com>
> wrote:
>
>
> Ashok,
> Are you working with straight spark or referring to GraphX?
>
>
> Thank You,
>
> Irving Duran
>
> On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
> wrote:
>
> Hi,
>
> I am a bit confused between Edge node, Edge server and gateway node in
> Spark.
>
> Do these mean the same thing?
>
> How does one set up an Edge node to be used in Spark? Is this different
> from Edge node for Hadoop please?
>
> Thanks
>
> -- -- -
> To unsubscribe e-mail: user-unsubscribe@spark.apache. org
> <user-unsubscr...@spark.apache.org>
>
>
>
>
>


Re: Edge Node in Spark

2017-06-06 Thread Irving Duran
Ashok,
Are you working with straight spark or referring to GraphX?


Thank You,

Irving Duran

On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:

> Hi,
>
> I am a bit confused between Edge node, Edge server and gateway node in
> Spark.
>
> Do these mean the same thing?
>
> How does one set up an Edge node to be used in Spark? Is this different
> from Edge node for Hadoop please?
>
> Thanks
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue upgrading to Spark 2.1.1 from 2.1.0

2017-05-07 Thread Irving Duran
I haven't noticed that on behavior on ALS.

Thank you,

Irving Duran

On 05/07/2017 04:14 PM, mhornbech wrote:
> Hi
>
> We have just tested the new Spark 2.1.1 release, and observe an issue where
> the driver program hangs when making predictions using a random forest. The
> issue disappears when downgrading to 2.1.0.
>
> Have anyone observed similar issues? Recommendations on how to dig into this
> would also be much appreciated. The driver program seemingly hangs (no
> messages in the log and no running spark jobs) with a constant 100% cpu
> usage.
>
> Morten
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tp28660.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>




signature.asc
Description: OpenPGP digital signature


Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Irving Duran
Thanks for the share!


Thank You,

Irving Duran

On Sun, Apr 2, 2017 at 7:19 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Interesting!
>
> --
> *From:* Robert Yokota <rayok...@gmail.com>
> *Sent:* Sunday, April 2, 2017 9:40:07 AM
> *To:* user@spark.apache.org
> *Subject:* Graph Analytics on HBase with HGraphDB and Spark GraphFrames
>
> Hi,
>
> In case anyone is interested in analyzing graphs in HBase with Apache
> Spark GraphFrames, this might be helpful:
>
> https://yokota.blog/2017/04/02/graph-analytics-on-hbase-with
> -hgraphdb-and-spark-graphframes/
>


Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Irving Duran
You can also run it on REPL and test to see if you are getting the expected
result.


Thank You,

Irving Duran

On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang <java8...@hotmail.com> wrote:

> You can always use explain method to validate your DF or SQL, before any
> action.
>
>
> Yong
>
>
> --
> *From:* Jacek Laskowski <ja...@japila.pl>
> *Sent:* Tuesday, February 21, 2017 4:34 AM
> *To:* Linyuxin
> *Cc:* user
> *Subject:* Re: [SparkSQL] pre-check syntex before running spark job?
>
> Hi,
>
> Never heard about such a tool before. You could use Antlr to parse SQLs
> (just as Spark SQL does while parsing queries). I think it's a one-hour
> project.
>
> Jacek
>
> On 21 Feb 2017 4:44 a.m., "Linyuxin" <linyu...@huawei.com> wrote:
>
> Hi All,
> Is there any tool/api to check the sql syntax without running spark job
> actually?
>
> Like the siddhiQL on storm here:
> SiddhiManagerService. validateExecutionPlan
> https://github.com/wso2/siddhi/blob/master/modules/siddhi-
> core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
> it can validate the syntax before running the sql on storm
>
> this is very useful for exposing sql string as a DSL of the platform.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: Graphx Examples for ALS

2017-02-17 Thread Irving Duran
Not sure I follow your question.  Do you want to use ALS or GraphX?


Thank You,

Irving Duran

On Fri, Feb 17, 2017 at 7:07 AM, balaji9058 <kssb...@gmail.com> wrote:

> Hi,
>
> Where can i find the the ALS recommendation algorithm for large data set?
>
> Please feel to share your ideas/algorithms/logic to build recommendation
> engine by using spark graphx
>
> Thanks in advance.
>
> Thanks,
> Balaji
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Graphx-Examples-for-ALS-tp28401.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-09 Thread Irving Duran
I would say Java, since it will be somewhat similar to Scala.  Now, this 
assumes that you have some app already written in Scala. If you don't, then 
pick the language that you feel most comfortable with.

Thank you,

Irving Duran

On Feb 9, 2017, at 11:59 PM, nancy henry <nancyhenry6...@gmail.com> wrote:

Hi All,

Is it better to Use Java or Python on Scala for Spark coding..

Mainly My work is with getting file data which is in csv format  and I have to 
do some rule checking and rule aggrgeation

and put the final filtered data back to oracle so that real time apps can use 
it..

Re: Spark: Scala Shell Very Slow (Unresponsive)

2017-02-06 Thread Irving Duran
I only experience this on the first time that I install a new spark
version.  Then after that, it flows smoothly.  My question is (since you
say your server), I assume that you are connecting remotely, so do you
experience the same latency when invoking remote commands?  If so, then it
might be your connection rather than spark.


Thank You,

Irving Duran

On Thu, Feb 2, 2017 at 3:34 PM, jimitkr <ji...@softpath.net> wrote:

> Friends,
>
> After i launch spark-shell, the default Scala shell appears but is
> unresponsive.
>
> When i type any command on the shell, nothing appears on my screen 
> shell is completely unresponsive.
>
> My server has 32 gigs of memory and approx 18 GB is empty after launching
> spark-shell, so it may not be a memory issue. Is there some JVM size i need
> to change somewhere?
>
> How do i get the scala shell to work as designed?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Scala-Shell-Very-Slow-
> Unresponsive-tp28358.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Shortest path performance in Graphx with Spark

2017-01-11 Thread Irving Duran
Hi Gerard,
How are you starting spark? Are you allocating enough RAM for processing? I
think the default is 512mb.  Try to doing the following and see if it helps
(based on the size of your dataset, you might not need all 8gb).

$SPARK_HOME/bin/spark-shell \
  --master local[4] \
  --executor-memory 8G \
  --driver-memory 8G



Thank You,

Irving Duran

On Tue, Jan 10, 2017 at 12:20 PM, Gerard Casey <gerardhughca...@gmail.com>
wrote:

> Hello everyone,
>
> I am creating a graph from a `gz` compressed `json` file of `edge` and
> `vertices` type.
>
> I have put the files in a dropbox folder [here][1]
>
> I load and map these `json` records to create the `vertices` and `edge`
> types required by `graphx` like this:
>
> val vertices_raw = sqlContext.read.json("path/vertices.json.gz")
> val vertices = vertices_raw.rdd.map(row=> ((row.getAs[String]("toid").
> stripPrefix("osgb").toLong),row.getAs[Long]("index")))
> val verticesRDD: RDD[(VertexId, Long)] = vertices
> val edges_raw = sqlContext.read.json("path/edges.json.gz")
> val edgesRDD = edges_raw.rdd.map(row=>(Edge(row.getAs[String]("
> positiveNode").stripPrefix("osgb").toLong, row.getAs[String]("
> negativeNode").stripPrefix("osgb").toLong, row.getAs[Double]("length"
> val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD,
> edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut)
>
> I then use this `dijkstra` implementation I found to compute a shortest
> path between two vertices:
>
> def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = {
>   var g2 = g.mapVertices(
> (vid, vd) => (false, if (vid == origin) 0 else
> Double.MaxValue, List[VertexId]())
>   )
>   for (i <- 1L to g.vertices.count - 1) {
> val currentVertexId: VertexId =
> g2.vertices.filter(!_._2._1)
>   .fold((0L, (false, Double.MaxValue, List[VertexId]((
> (a, b) => if (a._2._2 < b._2._2) a else b)
>   ._1
>
> val newDistances: VertexRDD[(Double, List[VertexId])] =
>   g2.aggregateMessages[(Double, List[VertexId])](
> ctx => if (ctx.srcId == currentVertexId) {
>   ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3
> :+ ctx.srcId))
> },
> (a, b) => if (a._1 < b._1) a else b
>   )
> g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => {
>   val newSumVal = newSum.getOrElse((Double.MaxValue,
> List[VertexId]()))
>   (
> vd._1 || vid == currentVertexId,
> math.min(vd._2, newSumVal._1),
> if (vd._2 < newSumVal._1) vd._3 else newSumVal._2
> )
> })
> }
>
>   g.outerJoinVertices(g2.vertices)((vid, vd, dist) =>
> (vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]()))
>   .productIterator.toList.tail
>   ))
> }
>
> I take two random vertex id's:
>
> val v1 = 400028222916L
> val v2 = 400031019012L
>
> and compute the path between them:
>
> val results = dijkstra(my_graph, v1).vertices.map(_._2).collect
>
> I am unable to compute this locally on my laptop without getting a
> stackoverflow error. I have 8GB RAM and 2.6 GHz Intel Core i5 processor. I
> can see that it is using 3 out of 4 cores available. I can load this graph
> and compute shortest on average around 10 paths per second with the
> `igraph` library in Python on exactly the same graph. Is this an
> inefficient means of computing paths? At scale, on multiple nodes the paths
> will compute (no stackoverflow error) but it is still 30/40seconds per path
> computation. I must be missing something.
>
> Thanks
>
>   [1]: https://www.dropbox.com/sh/9ug5ikr6j357q7j/AACDBR9UdM0g_
> ck_ykB8KXPXa?dl=0
>


Re: parsing embedded json in spark

2016-12-22 Thread Irving Duran
Is it an option to parse that field prior of creating the dataframe? If so,
that's what I would do.

In terms of your master node only working, you have to share more about
your structure, are you using spark standalone, yarn, or mesos?


Thank You,

Irving Duran

On Thu, Dec 22, 2016 at 1:42 AM, Tal Grynbaum <tal.grynb...@gmail.com>
wrote:

> Hi,
>
> I have a dataframe that contain an embedded json string in one of the
> fields
> I'd tried to write a UDF function that will parse it using lift-json, but
> it seems to take a very long time to process, and it seems that only the
> master node is working.
>
> Has anyone dealt with such a scenario before and can give me some hints?
>
> Thanks
> Tal
>


Re: Spark Batch checkpoint

2016-12-15 Thread Irving Duran
Not sure what programming language you are using, but in python you can do "
sc.setCheckpointDir('~/apps/spark-2.0.1-bin-hadoop2.7/checkpoint/')".  This
will store checkpoints on that directory that I called checkpoint.


Thank You,

Irving Duran

On Thu, Dec 15, 2016 at 10:33 AM, Selvam Raman <sel...@gmail.com> wrote:

> Hi,
>
> is there any provision in spark batch for checkpoint.
>
> I am having huge data, it takes more than 3 hours to process all data. I
> am currently having 100 partitions.
>
> if the job fails after two hours, lets say it has processed 70 partition.
> should i start spark job from the beginning or is there way for checkpoint
> provision.
>
> Checkpoint,what i am expecting is start from 71 partition to till end.
>
> Please give me your suggestions.
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: [Spark log4j] Turning off log4j while scala program runs on spark-submit

2016-12-12 Thread Irving Duran
Hi -
I have a question about log4j while running on spark-submit.

I would like to have spark only show errors when I am running
spark-submit.  I would like to accomplish with without having to edit log4j
config file on $SPARK_HOME, is there a way to do this?

I found this and it only works on spark-shell (not spark-submit) ->
http://stackoverflow.com/questions/27781187/how-to-
stop-messages-displaying-on-spark-console

Thanks for your help in advance.

Thank You,

Irving Duran


[Spark log4j] Turning off log4j while scala program runs on spark-submit

2016-12-09 Thread Irving Duran
Hi -
I have a question about log4j while running on spark-submit.

I would like to have spark only show errors when I am running
spark-submit.  I would like to accomplish with without having to edit log4j
config file on $SPARK_HOME, is there a way to do this?

I found this and it only works on spark-shell (not spark-submit) ->
http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console

Thanks for your help in advance.

Thank You,

Irving Duran