Re: Do we need to kill a spark job every time we change and deploy it?
Are you referring to have spark picking up a new jar build? If so, you can probably script that on bash. Thank You, Irving Duran On Wed, Nov 28, 2018 at 12:44 PM Mina Aslani wrote: > Hi, > > I have a question for you. > Do we need to kill a spark job every time we change and deploy it to > cluster? Or, is there a way for Spark to automatically pick up the recent > jar version? > > Best regards, > Mina >
Re: spark-shell doesn't start
You are trying to run "spark-shell" as a command which is not in your environment. You might want to do "./spark-shell" or try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" and then do "spark-shell". Thank You, Irving Duran On Sun, Jun 17, 2018 at 6:53 AM Raymond Xie wrote: > Hello, I am doing the practice in Ubuntu now, here is the error I am > encountering: > > > rxie@ubuntu:~/Downloads/spark/bin$ spark-shell > Error: Could not find or load main class org.apache.spark.launcher.Main > > > What am I missing? > > Thank you very much. > > Java is installed. > > ** > *Sincerely yours,* > > > *Raymond* >
Re: [Spark] Supporting python 3.5?
Cool, thanks for the validation! Thank You, Irving Duran On Thu, May 24, 2018 at 8:20 PM Jeff Zhang wrote: > > It supports python 3.5, and IIRC, spark also support python 3.6 > > Irving Duran 于2018年5月10日周四 下午9:08写道: > >> Does spark now support python 3.5 or it is just 3.4.x? >> >> https://spark.apache.org/docs/latest/rdd-programming-guide.html >> >> Thank You, >> >> Irving Duran >> >
Re: [announce] BeakerX supports Scala+Spark in Jupyter
So would you recommend not to have Toree and BeakerX installed to avoid conflicts? Thank you, Irving Duran On 06/07/2018 07:55 PM, s...@draves.org wrote: > The %%spark magic comes with BeakerX's Scala kernel, not related to Toree. > > On Thu, Jun 7, 2018, 8:51 PM Stephen Boesch <mailto:java...@gmail.com>> wrote: > > Assuming that the spark 2.X kernel (e.g. toree) were chosen for a > given jupyter notebook and there is a Cell 3 that contains some > Spark DataFrame operations .. Then : > > * what is the relationship does the %%spark magic and the > toree kernel? > * how does the %%spark magic get applied to that other Cell 3 ? > > thanks! > > 2018-06-07 16:33 GMT-07:00 s...@draves.org > <mailto:s...@draves.org> mailto:s...@draves.org>>: > > We are pleased to announce release 0.19.0 of BeakerX > <http://BeakerX.com>, a collection of extensions and kernels > for Jupyter and Jupyter Lab. > > BeakerX now features Scala+Spark integration including GUI > configuration, status, progress, interrupt, and interactive > tables. > > We are very interested in your feedback about what remains to > be done. You may reach by github and gitter, as documented in > the readme: https://github.com/twosigma/beakerx > > Thanks, -Scott > > spark.png > > > -- > BeakerX.com <http://BeakerX.com> > ScottDraves.com <http://www.ScottDraves.com> > @Scott_Draves <http://twitter.com/scott_draves> > > signature.asc Description: OpenPGP digital signature
Re: If there is timestamp type data in DF, Spark 2.3 toPandas is much slower than spark 2.2.
I haven't noticed or seen this behavior. Have you noticed this with by testing the same dataset between versions? Thank you, Irving Duran On 06/06/2018 11:22 PM, 李斌松 wrote: > If there is timestamp type data in DF, Spark 2.3 toPandas is much > slower than spark 2.2. signature.asc Description: OpenPGP digital signature
Re: Apache Spark Installation error
You probably want to recognize "spark-shell" as a command in your environment. Maybe try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" Have you tried "./spark-shell" in the current path to see if it works? Thank You, Irving Duran On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote: > Hi there, > >I am not able to execute the spark-shell command. Can you please help. > > Thanks > > Remil > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: GroupBy in Spark / Scala without Agg functions
Unless you want to get a count, yes. Thank You, Irving Duran On Tue, May 29, 2018 at 1:44 PM Chetan Khatri wrote: > Georg, I just want to double check that someone wrote MSSQL Server script > where it's groupby all columns. What is alternate best way to do distinct > all columns ? > > > > On Wed, May 30, 2018 at 12:08 AM, Georg Heiler > wrote: > >> Why do you group if you do not want to aggregate? >> Isn't this the same as select distinct? >> >> Chetan Khatri schrieb am Di., 29. Mai 2018 >> um 20:21 Uhr: >> >>> All, >>> >>> I have scenario like this in MSSQL Server SQL where i need to do groupBy >>> without Agg function: >>> >>> Pseudocode: >>> >>> >>> select m.student_id, m.student_name, m.student_std, m.student_group, >>> m.student_d >>> ob from student as m inner join general_register g on m.student_id = >>> g.student_i >>> d group by m.student_id, m.student_name, m.student_std, m.student_group, >>> m.student_dob >>> >>> I tried to doing in spark but i am not able to get Dataframe as return >>> value, how this kind of things could be done in Spark. >>> >>> Thanks >>> >> >
[Spark] Supporting python 3.5?
Does spark now support python 3.5 or it is just 3.4.x? https://spark.apache.org/docs/latest/rdd-programming-guide.html Thank You, Irving Duran
Re: [pyspark] Read multiple files parallely into a single dataframe
I could be wrong, but I think you can do a wild card. df = spark.read.format('csv').load('/path/to/file*.csv.gz') Thank You, Irving Duran On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > Hi, > > I want to read multiple files parallely into 1 dataframe. But the files > have random names and cannot confirm to any pattern (so I can't use > wildcard). Also, the files can be in different directories. > If I provide the file names in a list to the dataframe reader, it reads > then sequentially. > Eg: > df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz']) > This reads the files sequentially. What can I do to read the files > parallely? > I noticed that spark reads files parallely if provided directly the > directory location. How can that be extended to multiple random files? > Suppose if my system has 4 cores, how can I make spark read 4 files at a > time? > > Please suggest. >
Re: ML Linear and Logistic Regression - Poor Performance
May want to think about reducing the number of iterations. Right now you have it set at 500. Thank You, Irving Duran On Fri, Apr 27, 2018 at 7:15 PM Thodoris Zois <z...@ics.forth.gr> wrote: > I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code. > Logistic regression took 85 minutes and linear regression 127 seconds… > > My dataset as I said is 128 MB and contains: 1000 features and ~100 > classes. > > > #SparkSession > ss = SparkSession.builder.getOrCreate() > > > start = time.time() > > #Read data > trainData = ss.read.format("csv").option("inferSchema","true").load(file) > > #Calculate Features > assembler = VectorAssembler(inputCols=trainData.columns[1:], outputCol= > "features") > trainData = assembler.transform(trainData) > > #Drop columns > dropColumns = trainData.columns > dropColumns = [e for e in dropColumns if e not in ('_c0', 'features')] > trainData = trainData.drop(*dropColumns) > > #Rename column from _c0 to label > trainData = trainData.withColumnRenamed("_c0", "label") > > #Logistic regression > lr = LogisticRegression(maxIter=500, regParam=0.3, elasticNetParam=0.8) > lrModel = lr.fit(trainData) > > #Output Coefficients > print("Coefficients: " + str(lrModel.coefficientMatrix)) > > > > - Thodoris > > > On 27 Apr 2018, at 22:50, Irving Duran <irving.du...@gmail.com> wrote: > > Are you reformatting the data correctly for logistic (meaning 0 & 1's) > before modeling? What are OS and spark version you using? > > Thank You, > > Irving Duran > > > On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois <z...@ics.forth.gr> wrote: > >> Hello, >> >> I am running an experiment to test logistic and linear regression on >> spark using MLlib. >> >> My dataset is only 128MB and something weird happens. Linear regression >> takes about 127 seconds either with 1 or 500 iterations. On the other hand, >> logistic regression most of the times does not manage to finish either with >> 1 iteration. I usually get memory heap error. >> >> In both cases I use the default cores and memory for driver and I spawn 1 >> executor with 1 core and 2GBs of memory. >> >> Except that, I get a warning about NativeBLAS. I searched in the Internet >> and I found that I have to install libgfortran. Even if I did it the >> warning remains. >> >> Any ideas for the above? >> >> Thank you, >> - Thodoris >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: [Spark 2.x Core] .collect() size limit
I don't think there is a magic number, so I would say that it will depend on how big your dataset is and the size of your worker(s). Thank You, Irving Duran On Sat, Apr 28, 2018 at 10:41 AM klrmowse <klrmo...@gmail.com> wrote: > i am currently trying to find a workaround for the Spark application i am > working on so that it does not have to use .collect() > > but, for now, it is going to have to use .collect() > > what is the size limit (memory for the driver) of RDD file that .collect() > can work with? > > i've been scouring google-search - S.O., blogs, etc, and everyone is > cautioning about .collect(), but does not specify how huge is huge... are > we > talking about a few gigabytes? terabytes?? petabytes??? > > > > thank you > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: ML Linear and Logistic Regression - Poor Performance
Are you reformatting the data correctly for logistic (meaning 0 & 1's) before modeling? What are OS and spark version you using? Thank You, Irving Duran On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois <z...@ics.forth.gr> wrote: > Hello, > > I am running an experiment to test logistic and linear regression on spark > using MLlib. > > My dataset is only 128MB and something weird happens. Linear regression > takes about 127 seconds either with 1 or 500 iterations. On the other hand, > logistic regression most of the times does not manage to finish either with > 1 iteration. I usually get memory heap error. > > In both cases I use the default cores and memory for driver and I spawn 1 > executor with 1 core and 2GBs of memory. > > Except that, I get a warning about NativeBLAS. I searched in the Internet > and I found that I have to install libgfortran. Even if I did it the > warning remains. > > Any ideas for the above? > > Thank you, > - Thodoris > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Can spark handle this scenario?
Do you only want to use Scala? Because otherwise, I think with pyspark and pandas read table you should be able to accomplish what you want to accomplish. Thank you, Irving Duran On 02/16/2018 06:10 PM, Lian Jiang wrote: > Hi, > > I have a user case: > > I want to download S stock data from Yahoo API in parallel using > Spark. I have got all stock symbols as a Dataset. Then I used below > code to call Yahoo API for each symbol: > > > > case class Symbol(symbol: String, sector: String) > > case class Tick(symbol: String, sector: String, open: Double, close: > Double) > > > // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] > > > symbolDs.map { k => > > pullSymbolFromYahoo(k.symbol, k.sector) > > } > > > This statement cannot compile: > > > Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will > be added in future releases. > > > > My questions are: > > > 1. As you can see, this scenario is not traditional dataset handling > such as count, sql query... Instead, it is more like a UDF which apply > random operation on each record. Is Spark good at handling such scenario? > > > 2. Regarding the compilation error, any fix? I did not find a > satisfactory solution online. > > > Thanks for help! > > > > signature.asc Description: OpenPGP digital signature
Re: Do we always need to go through spark-submit?
I don't know how this would work, but maybe your .jar calls spark-submit from within your jar if you were to compile the jar with the spark-submit class. Thank You, Irving Duran On Wed, Aug 30, 2017 at 10:57 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I understand spark-submit sets up its own class loader and other things > but I am wondering if it is possible to just compile the code and run it > using "java -jar mysparkapp.jar" ? > > Thanks, > kant >
Re: [Spark] Can Apache Spark be used with time series processing?
I think it will work. Might want to explore spark streams. Thank You, Irving Duran On Wed, Aug 30, 2017 at 10:50 AM, <kanth...@gmail.com> wrote: > I don't see why not > > Sent from my iPhone > > > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov < > alexandr.poru...@gmail.com> wrote: > > > > Hello, > > > > I am new in Apache Spark. I need to process different time series data > (numeric values which depend on time) and react on next actions: > > 1. Data is changing up or down too fast. > > 2. Data is changing constantly up or down too long. > > > > For example, if the data have changed 30% up or down in the last five > minutes (or less), then I need to send a special event. > > If the data have changed 50% up or down in two hours (or less), then I > need to send a special event. > > > > Frequency of data changing is about 1000-3000 per second. And I need to > react as soon as possible. > > > > Does Apache Spark fit well for this scenario or I need to search for > another solution? > > Sorry for stupid question, but I am a total newbie. > > > > Regards > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type
I think there is a difference between the actual value in the cell and what Excel formats that cell. You probably want to import that field as a string or not have it as a date format in Excel. Just a thought Thank You, Irving Duran On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > Hey all, > > Forgot to attach the link to the overriding Schema through external > package's discussion. > > https://github.com/crealytics/spark-excel/pull/13 > > You can see my comment there too. > > Thanks, > Aakash. > > On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hi all, >> >> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to >> fetch data from an excel file using >> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring >> double for a date type column. >> >> The detailed description is given here (the question I posted) - >> >> https://stackoverflow.com/questions/45713699/inferschema- >> using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d >> >> >> Found it is a probable bug with the crealytics excel read package. >> >> Can somebody help me with a workaround for this? >> >> Thanks, >> Aakash. >> > >
Re: ALSModel.load not working on pyspark 2.1.0
I think the problem is because you are calling "model2 = ALSModel.load("/models/als")" instead of "model2 = *model*.load("/models/als")". See my working sample below. >>> model.save('/models/als.test') SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. >>> model2 = model.load('/models/als.test') >>> model ALS_4324a1082d889dd1f0e4 >>> model2 ALS_4324a1082d889dd1f0e4 Thank You, Irving Duran On Sat, Jul 29, 2017 at 2:57 PM, Cristian Garcia <cgarcia@gmail.com> wrote: > This code is not working: > > > from pyspark.ml.evaluation import RegressionEvaluator > from pyspark.ml.recommendation import ALS, ALSModel > from pyspark.sql import Row > > als = ALS(maxIter=10, regParam=0.01, userCol="user_id", > itemCol="movie_id", ratingCol="rating") > model = als.fit(training) > > model.save("/models/als") > > model2 = ALSModel.load("/models/als") # <-- error here > = > > > > Gives rise to this error: > = > > ---Py4JJavaError > Traceback (most recent call > last) in ()> 1 m2 = > ALSModel.load("/models/als") > /usr/local/spark/python/pyspark/ml/util.py in load(cls, path)251 def > load(cls, path):252 """Reads an ML instance from the input path, > a shortcut of `read().load(path)`."""--> 253 return > cls.read().load(path)254 255 > /usr/local/spark/python/pyspark/ml/util.py in load(self, path)192 > if not isinstance(path, basestring):193 raise TypeError("path > should be a basestring, got type %s" % type(path))--> 194 java_obj = > self._jread.load(path)195 if not hasattr(self._clazz, > "_from_java"):196 raise NotImplementedError("This Java ML > type cannot be loaded into Python currently: %r" > /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in > __call__(self, *args) 1131 answer = > self.gateway_client.send_command(command) 1132 return_value = > get_return_value(-> 1133 answer, self.gateway_client, > self.target_id, self.name) 11341135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def > deco(*a, **kw): 62 try:---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: 65 s = > e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name)317 >raise Py4JJavaError(318 "An error occurred while > calling {0}{1}{2}.\n".--> 319 format(target_id, ".", > name), value)320 else:321 raise Py4JError( > Py4JJavaError: An error occurred while calling o337.load. > : java.lang.UnsupportedOperationException: empty collection > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1370) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.first(RDD.scala:1367) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:379) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:748) > > = >
Re: Scala, Python or Java for Spark programming
gt;>>>>> Java is often underestimated, because people are not aware of its >>>>>> lambda functionality which makes the code very readable. Scala - it >>>>>> depends >>>>>> who programs it. People coming with the normal Java background write >>>>>> Java-like code in scala which might not be so good. People from a >>>>>> functional background write it more functional like - i.e. You have a lot >>>>>> of things in one line of code which can be a curse even for other >>>>>> functional programmers, especially if the application is distributed as >>>>>> in >>>>>> the case of Spark. Usually no comment is provided and you have - even as >>>>>> a >>>>>> functional programmer - to do a lot of drill down. Python is somehow >>>>>> similar, but since it has no connection with Java you do not have these >>>>>> extremes. There it depends more on the community (e.g. Medical, >>>>>> financials) >>>>>> and skills of people how the code look likes. >>>>>> However the difficulty comes with the distributed applications behind >>>>>> Spark which may have unforeseen side effects if the users do not know >>>>>> this, >>>>>> ie if they have never been used to parallel programming. >>>>>> >>>>>> On 7. Jun 2017, at 17:20, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am a fan of Scala and functional programming hence I prefer Scala. >>>>>> >>>>>> I had a discussion with a hardcore Java programmer and a data >>>>>> scientist who prefers Python. >>>>>> >>>>>> Their view is that in a collaborative work using Scala programming it >>>>>> is almost impossible to understand someone else's Scala code. >>>>>> >>>>>> Hence I was wondering how much truth is there in this statement. >>>>>> Given that Spark uses Scala as its core development language, what is the >>>>>> general view on the use of Scala, Python or Java? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> -- Thank You, Irving Duran
Re: Adding header to an rdd before saving to text file
Not a best option, but I've done this before. If you know the columns structure you could manually write them to the file before exporting. On Tue, Jun 6, 2017 at 12:39 AM 颜发才(Yan Facai) <facai@gmail.com> wrote: > Hi, upendra. > It will be easier to use DataFrame to read/save csv file with header, if > you'd like. > > On Tue, Jun 6, 2017 at 5:15 AM, upendra 1991 < > upendra1...@yahoo.com.invalid> wrote: > >> I am reading a CSV(file has headers header 1st,header2) and generating >> rdd, >> After few transformations I create an rdd and finally write it to a txt >> file. >> >> What's the best way to add the header from source file, into rdd and have >> it available as header into new file I.e, when I transform the rdd into >> textfile using saveAsTexFile("newfile") the header 1, header 2 shall be >> available. >> >> >> Thanks, >> Upendra >> > > -- Thank You, Irving Duran
Re: Edge Node in Spark
Where in the documentation did you find "edge node"? Spark would call it worker or executor, but not "edge node". Her is some info about yarn logs -> https://spark.apache.org/docs/latest/running-on-yarn.html. Thank You, Irving Duran On Tue, Jun 6, 2017 at 11:48 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: > Just Straight Spark please. > > Also if I run a spark job using Python or Scala using Yarn where the log > files are kept in the edge node? Are these under logs directory for yarn? > > thanks > > > On Tuesday, 6 June 2017, 14:11, Irving Duran <irving.du...@gmail.com> > wrote: > > > Ashok, > Are you working with straight spark or referring to GraphX? > > > Thank You, > > Irving Duran > > On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> > wrote: > > Hi, > > I am a bit confused between Edge node, Edge server and gateway node in > Spark. > > Do these mean the same thing? > > How does one set up an Edge node to be used in Spark? Is this different > from Edge node for Hadoop please? > > Thanks > > -- -- - > To unsubscribe e-mail: user-unsubscribe@spark.apache. org > <user-unsubscr...@spark.apache.org> > > > > >
Re: Edge Node in Spark
Ashok, Are you working with straight spark or referring to GraphX? Thank You, Irving Duran On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: > Hi, > > I am a bit confused between Edge node, Edge server and gateway node in > Spark. > > Do these mean the same thing? > > How does one set up an Edge node to be used in Spark? Is this different > from Edge node for Hadoop please? > > Thanks > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Issue upgrading to Spark 2.1.1 from 2.1.0
I haven't noticed that on behavior on ALS. Thank you, Irving Duran On 05/07/2017 04:14 PM, mhornbech wrote: > Hi > > We have just tested the new Spark 2.1.1 release, and observe an issue where > the driver program hangs when making predictions using a random forest. The > issue disappears when downgrading to 2.1.0. > > Have anyone observed similar issues? Recommendations on how to dig into this > would also be much appreciated. The driver program seemingly hangs (no > messages in the log and no running spark jobs) with a constant 100% cpu > usage. > > Morten > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Issue-upgrading-to-Spark-2-1-1-from-2-1-0-tp28660.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > signature.asc Description: OpenPGP digital signature
Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames
Thanks for the share! Thank You, Irving Duran On Sun, Apr 2, 2017 at 7:19 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Interesting! > > -- > *From:* Robert Yokota <rayok...@gmail.com> > *Sent:* Sunday, April 2, 2017 9:40:07 AM > *To:* user@spark.apache.org > *Subject:* Graph Analytics on HBase with HGraphDB and Spark GraphFrames > > Hi, > > In case anyone is interested in analyzing graphs in HBase with Apache > Spark GraphFrames, this might be helpful: > > https://yokota.blog/2017/04/02/graph-analytics-on-hbase-with > -hgraphdb-and-spark-graphframes/ >
Re: [SparkSQL] pre-check syntex before running spark job?
You can also run it on REPL and test to see if you are getting the expected result. Thank You, Irving Duran On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang <java8...@hotmail.com> wrote: > You can always use explain method to validate your DF or SQL, before any > action. > > > Yong > > > -- > *From:* Jacek Laskowski <ja...@japila.pl> > *Sent:* Tuesday, February 21, 2017 4:34 AM > *To:* Linyuxin > *Cc:* user > *Subject:* Re: [SparkSQL] pre-check syntex before running spark job? > > Hi, > > Never heard about such a tool before. You could use Antlr to parse SQLs > (just as Spark SQL does while parsing queries). I think it's a one-hour > project. > > Jacek > > On 21 Feb 2017 4:44 a.m., "Linyuxin" <linyu...@huawei.com> wrote: > > Hi All, > Is there any tool/api to check the sql syntax without running spark job > actually? > > Like the siddhiQL on storm here: > SiddhiManagerService. validateExecutionPlan > https://github.com/wso2/siddhi/blob/master/modules/siddhi- > core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java > it can validate the syntax before running the sql on storm > > this is very useful for exposing sql string as a DSL of the platform. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > >
Re: Graphx Examples for ALS
Not sure I follow your question. Do you want to use ALS or GraphX? Thank You, Irving Duran On Fri, Feb 17, 2017 at 7:07 AM, balaji9058 <kssb...@gmail.com> wrote: > Hi, > > Where can i find the the ALS recommendation algorithm for large data set? > > Please feel to share your ideas/algorithms/logic to build recommendation > engine by using spark graphx > > Thanks in advance. > > Thanks, > Balaji > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Graphx-Examples-for-ALS-tp28401.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Is it better to Use Java or Python on Scala for Spark for using big data sets
I would say Java, since it will be somewhat similar to Scala. Now, this assumes that you have some app already written in Scala. If you don't, then pick the language that you feel most comfortable with. Thank you, Irving Duran On Feb 9, 2017, at 11:59 PM, nancy henry <nancyhenry6...@gmail.com> wrote: Hi All, Is it better to Use Java or Python on Scala for Spark coding.. Mainly My work is with getting file data which is in csv format and I have to do some rule checking and rule aggrgeation and put the final filtered data back to oracle so that real time apps can use it..
Re: Spark: Scala Shell Very Slow (Unresponsive)
I only experience this on the first time that I install a new spark version. Then after that, it flows smoothly. My question is (since you say your server), I assume that you are connecting remotely, so do you experience the same latency when invoking remote commands? If so, then it might be your connection rather than spark. Thank You, Irving Duran On Thu, Feb 2, 2017 at 3:34 PM, jimitkr <ji...@softpath.net> wrote: > Friends, > > After i launch spark-shell, the default Scala shell appears but is > unresponsive. > > When i type any command on the shell, nothing appears on my screen > shell is completely unresponsive. > > My server has 32 gigs of memory and approx 18 GB is empty after launching > spark-shell, so it may not be a memory issue. Is there some JVM size i need > to change somewhere? > > How do i get the scala shell to work as designed? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-Scala-Shell-Very-Slow- > Unresponsive-tp28358.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Shortest path performance in Graphx with Spark
Hi Gerard, How are you starting spark? Are you allocating enough RAM for processing? I think the default is 512mb. Try to doing the following and see if it helps (based on the size of your dataset, you might not need all 8gb). $SPARK_HOME/bin/spark-shell \ --master local[4] \ --executor-memory 8G \ --driver-memory 8G Thank You, Irving Duran On Tue, Jan 10, 2017 at 12:20 PM, Gerard Casey <gerardhughca...@gmail.com> wrote: > Hello everyone, > > I am creating a graph from a `gz` compressed `json` file of `edge` and > `vertices` type. > > I have put the files in a dropbox folder [here][1] > > I load and map these `json` records to create the `vertices` and `edge` > types required by `graphx` like this: > > val vertices_raw = sqlContext.read.json("path/vertices.json.gz") > val vertices = vertices_raw.rdd.map(row=> ((row.getAs[String]("toid"). > stripPrefix("osgb").toLong),row.getAs[Long]("index"))) > val verticesRDD: RDD[(VertexId, Long)] = vertices > val edges_raw = sqlContext.read.json("path/edges.json.gz") > val edgesRDD = edges_raw.rdd.map(row=>(Edge(row.getAs[String](" > positiveNode").stripPrefix("osgb").toLong, row.getAs[String](" > negativeNode").stripPrefix("osgb").toLong, row.getAs[Double]("length" > val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, > edgesRDD).partitionBy(PartitionStrategy.RandomVertexCut) > > I then use this `dijkstra` implementation I found to compute a shortest > path between two vertices: > > def dijkstra[VD](g: Graph[VD, Double], origin: VertexId) = { > var g2 = g.mapVertices( > (vid, vd) => (false, if (vid == origin) 0 else > Double.MaxValue, List[VertexId]()) > ) > for (i <- 1L to g.vertices.count - 1) { > val currentVertexId: VertexId = > g2.vertices.filter(!_._2._1) > .fold((0L, (false, Double.MaxValue, List[VertexId](( > (a, b) => if (a._2._2 < b._2._2) a else b) > ._1 > > val newDistances: VertexRDD[(Double, List[VertexId])] = > g2.aggregateMessages[(Double, List[VertexId])]( > ctx => if (ctx.srcId == currentVertexId) { > ctx.sendToDst((ctx.srcAttr._2 + ctx.attr, ctx.srcAttr._3 > :+ ctx.srcId)) > }, > (a, b) => if (a._1 < b._1) a else b > ) > g2 = g2.outerJoinVertices(newDistances)((vid, vd, newSum) => { > val newSumVal = newSum.getOrElse((Double.MaxValue, > List[VertexId]())) > ( > vd._1 || vid == currentVertexId, > math.min(vd._2, newSumVal._1), > if (vd._2 < newSumVal._1) vd._3 else newSumVal._2 > ) > }) > } > > g.outerJoinVertices(g2.vertices)((vid, vd, dist) => > (vd, dist.getOrElse((false, Double.MaxValue, List[VertexId]())) > .productIterator.toList.tail > )) > } > > I take two random vertex id's: > > val v1 = 400028222916L > val v2 = 400031019012L > > and compute the path between them: > > val results = dijkstra(my_graph, v1).vertices.map(_._2).collect > > I am unable to compute this locally on my laptop without getting a > stackoverflow error. I have 8GB RAM and 2.6 GHz Intel Core i5 processor. I > can see that it is using 3 out of 4 cores available. I can load this graph > and compute shortest on average around 10 paths per second with the > `igraph` library in Python on exactly the same graph. Is this an > inefficient means of computing paths? At scale, on multiple nodes the paths > will compute (no stackoverflow error) but it is still 30/40seconds per path > computation. I must be missing something. > > Thanks > > [1]: https://www.dropbox.com/sh/9ug5ikr6j357q7j/AACDBR9UdM0g_ > ck_ykB8KXPXa?dl=0 >
Re: parsing embedded json in spark
Is it an option to parse that field prior of creating the dataframe? If so, that's what I would do. In terms of your master node only working, you have to share more about your structure, are you using spark standalone, yarn, or mesos? Thank You, Irving Duran On Thu, Dec 22, 2016 at 1:42 AM, Tal Grynbaum <tal.grynb...@gmail.com> wrote: > Hi, > > I have a dataframe that contain an embedded json string in one of the > fields > I'd tried to write a UDF function that will parse it using lift-json, but > it seems to take a very long time to process, and it seems that only the > master node is working. > > Has anyone dealt with such a scenario before and can give me some hints? > > Thanks > Tal >
Re: Spark Batch checkpoint
Not sure what programming language you are using, but in python you can do " sc.setCheckpointDir('~/apps/spark-2.0.1-bin-hadoop2.7/checkpoint/')". This will store checkpoints on that directory that I called checkpoint. Thank You, Irving Duran On Thu, Dec 15, 2016 at 10:33 AM, Selvam Raman <sel...@gmail.com> wrote: > Hi, > > is there any provision in spark batch for checkpoint. > > I am having huge data, it takes more than 3 hours to process all data. I > am currently having 100 partitions. > > if the job fails after two hours, lets say it has processed 70 partition. > should i start spark job from the beginning or is there way for checkpoint > provision. > > Checkpoint,what i am expecting is start from 71 partition to till end. > > Please give me your suggestions. > > -- > Selvam Raman > "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" >
Re: [Spark log4j] Turning off log4j while scala program runs on spark-submit
Hi - I have a question about log4j while running on spark-submit. I would like to have spark only show errors when I am running spark-submit. I would like to accomplish with without having to edit log4j config file on $SPARK_HOME, is there a way to do this? I found this and it only works on spark-shell (not spark-submit) -> http://stackoverflow.com/questions/27781187/how-to- stop-messages-displaying-on-spark-console Thanks for your help in advance. Thank You, Irving Duran
[Spark log4j] Turning off log4j while scala program runs on spark-submit
Hi - I have a question about log4j while running on spark-submit. I would like to have spark only show errors when I am running spark-submit. I would like to accomplish with without having to edit log4j config file on $SPARK_HOME, is there a way to do this? I found this and it only works on spark-shell (not spark-submit) -> http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console Thanks for your help in advance. Thank You, Irving Duran