Re: How to parallel read files in a directory

2016-02-12 Thread Jörn Franke
Put many small files in Hadoop Archives (HAR) to improve performance of reading small files. Alternatively have a batch job concatenating them. > On 11 Feb 2016, at 18:33, Junjie Qian wrote: > > Hi all, > > I am working with Spark 1.6, scala and have a big dataset

Re: Handling Hive Table With large number of rows

2016-02-07 Thread Jörn Franke
Can you provide more details? Your use case does not sound you need Spark. Your version is anyway too old. It does not make sense to develop now with 1.2.1 . There is no "project limitation" that is able to justify this. > On 08 Feb 2016, at 06:48, Meetu Maltiar wrote:

Re: Spark with SAS

2016-02-03 Thread Jörn Franke
This could be done through odbc. Keep in mind that you can run SaS jobs directly on a Hadoop cluster using the SaS embedded process engine or dump some data to SaS lasr cluster, but you better ask SaS about this. > On 03 Feb 2016, at 18:43, Sourav Mazumder wrote: >

Re: how to introduce spark to your colleague if he has no background about *** spark related

2016-01-31 Thread Jörn Franke
It depends of course on the background of the people but how about some examples ("word count") how it works in the background. > On 01 Feb 2016, at 07:31, charles li wrote: > > > Apache Spark™ is a fast and general engine for large-scale data processing. > > it's a

Re: Spark Pattern and Anti-Pattern

2016-01-26 Thread Jörn Franke
Spark has its best use cases in in-memory batch processing / machine learning. Connecting multiple different sources/destination requires some thinking and probably more than spark. Connecting spark to a database makes only in very few cases sense. You will have huge performance issues due to

Re: Parquet write optimization by row group size config

2016-01-20 Thread Jörn Franke
What is your data size, the algorithm and the expected time? Depending on this the group can recommend you optimizations or tell you that the expectations are wrong > On 20 Jan 2016, at 18:24, Pavel Plotnikov > wrote: > > Thanks, Akhil! It helps, but this jobs

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Jörn Franke
Ignite can also cache rdd > On 12 Jan 2016, at 13:06, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > > Jorn, you said Ignite or ... ? What was the second choice you were thinking > of? It seems that got omitted. > >> On Jan 12, 2016, at 2:44 AM, Jörn Franke &

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Jörn Franke
You can look at ignite as a HDFS cache or for storing rdds. > On 11 Jan 2016, at 21:14, Dmitry Goldenberg wrote: > > We have a bunch of Spark jobs deployed and a few large resource files such as > e.g. a dictionary for lookups or a statistical model. > > Right now,

Re: Update Hive tables from Spark without loading entire table in to a dataframe

2016-01-06 Thread Jörn Franke
You can mark the table as transactional and then you can do single updates. > On 07 Jan 2016, at 08:10, sudhir wrote: > > Hi, > > I have a hive table of 20Lakh records and to update a row I have to load the > entire table in dataframe and process that and then Save it

Re: Need Help in Spark Hive Data Processing

2016-01-06 Thread Jörn Franke
You need the table in an efficient format, such as Orc or parquet. Have the table sorted appropriately (hint: most discriminating column in the where clause). Do not use SAN or virtualization for the slave nodes. Can you please post your query. I always recommend to avoid single updates where

Re: How to run multiple Spark jobs as a workflow that takes input from a Streaming job in Oozie

2015-12-20 Thread Jörn Franke
Flume could be interesting for you. > On 19 Dec 2015, at 00:27, SRK wrote: > > Hi, > > How to run multiple Spark jobs that takes Spark Streaming data as the > input as a workflow in Oozie? We have to run our Streaming job first and > then have a workflow of Spark

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Jörn Franke
Can you elaborate a little bit more on the use case? It looks a little bit like an abuse of Spark in general . Interactive queries that are not suitable for in-memory batch processing might be better supported by ignite that has in-memory indexes, concept of hot, warm, cold data etc. or hive on

Re: Spark with MapDB

2015-12-08 Thread Jörn Franke
You may want to use a bloom filter for this, but make sure that you understand how it works > On 08 Dec 2015, at 09:44, Ramkumar V wrote: > > Im running spark batch job in cluster mode every hour and it runs for 15 > minutes. I have certain unique keys in the dataset.

Re: Logging spark output to hdfs file

2015-12-08 Thread Jörn Franke
This would require a special HDFS log4j appender. Alternatively try the flume log4j appender > On 08 Dec 2015, at 13:00, sunil m <260885smanik...@gmail.com> wrote: > > Hi! > I configured log4j.properties file in conf folder of spark with following > values... > >

Re: Graph visualization tool for GraphX

2015-12-08 Thread Jörn Franke
I am not sure about your use case. How should a human interpret many terabytes of data in one large visualization?? You have to be more specific, what part of the data needs to be visualized, what kind of visualization, what navigation do you expect within the visualisation, how many users,

Re: Low Latency SQL query

2015-12-01 Thread Jörn Franke
can you elaborate more on the use case? > On 01 Dec 2015, at 20:51, Andrés Ivaldi wrote: > > Hi, > > I'd like to use spark to perform some transformations over data stored inSQL, > but I need low Latency, I'm doing some test and I run into spark context > creation and

Re: Low Latency SQL query

2015-12-01 Thread Jörn Franke
> as Measure and Prodcut, Product Family as Dimension > > Only 3 columns, it takes like 20s to perform that query and the aggregation, > the query directly to the database with a grouping at the columns takes like > 1s > > regards > > > >> On Tue, Dec 1, 2

Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Jörn Franke
I would not use MongoDB because it does not fit well into the Spark or Hadoop architecture. You can use it if your data amount is very small and already preaggregated, but this is a very limited use case. You can use Hbase or with future versions of Hive (if they use TEZ > 0.8) For interactive

Re: Hive using Spark engine alone

2015-11-27 Thread Jörn Franke
Hi, I recommend to use the latest version of Hive. You may also wait for hive on tez with tez version >= 0.8 and hive > 1.2. Before that I recommend first trying other optimizations of Hive and have a look at the storage format together with storage indexes (not the regular ones), bloom

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Jörn Franke
I would have to check the Spark source code, but theoretically you can limit the no of threads on the jvm level. Maybe spark does this.Alternatively, you can use cgroups, but this introduces other complexity. > On 10 Nov 2015, at 14:33, Peter Rudenko wrote: > > Hi i

Re: though experiment: Can I use spark streaming to replace all of my rest services?

2015-11-10 Thread Jörn Franke
Maybe you look for web sockets/stomp to get it to the end user? Or http2/stomp in the future > On 10 Nov 2015, at 21:28, Andy Davidson wrote: > > I just finished watching a great presentation from a recent spark summit on > real time movie recommendations using

Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread Jörn Franke
Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas. In any case - it depends all highly on

Re: very slow parquet file write

2015-11-06 Thread Jörn Franke
Do you use some compression? Maybe there is some activated by default in your Hadoop environment? > On 06 Nov 2015, at 00:34, rok wrote: > > Apologies if this appears a second time! > > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a >

Re: ipython notebook NameError: name 'sc' is not defined

2015-11-02 Thread Jörn Franke
You can check a script that I created for the Amazon cloud: https://snippetessay.wordpress.com/2015/04/18/big-data-lab-in-the-cloud-with-hadoopsparkrpython/ If I remember correctly then you need to add something to the startup py for ipython > On 03 Nov 2015, at 01:04, Andy Davidson

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jörn Franke
Try with max date, in your case it could make more sense to represent the date as int Sent from my iPhone > On 01 Nov 2015, at 21:03, Koert Kuipers wrote: > > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date,

Re: Spark tunning increase number of active tasks

2015-10-31 Thread Jörn Franke
Maybe Hortonworks support can help you much better. Otherwise you may want to change the yarn scheduler configuration and preemption. Do you use something like speculative execution? How do you start execution of the programs? Maybe you are already using all cores of the master... > On 30 Oct

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Jörn Franke
What Storage Format? > On 30 Oct 2015, at 12:05, Rex Xiong wrote: > > Hi folks, > > I have a Hive external table with partitions. > Every day, an App will generate a new partition day=-MM-dd stored by > parquet and run add-partition Hive command. > In some cases, we

Re: Spark with business rules

2015-10-26 Thread Jörn Franke
Maybe SparkR? What languages do your Users speak? > On 26 Oct 2015, at 23:12, danilo wrote: > > Hi All, I want to create a monitoring tool using my sensor data. I receive > the events every seconds and I need to create a report using node.js. Right > now I created my kpi

Re: Should I convert json into parquet?

2015-10-18 Thread Jörn Franke
Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy. They are much faster than JSON. however, the table structure is up to you and depends on your use case. > On 17 Oct 2015, at 23:07, Gavin Yue wrote: > > I have json files which

Re: Datastore or DB for spark

2015-10-09 Thread Jörn Franke
I am not aware of any empirical evidence, but I think hadoop (HDFS) as a datastore for Spark is quiet common. With relational databases you usually do not have so much data and you do not benefit from data locality. Le ven. 9 oct. 2015 à 15:16, Rahul Jeevanandam a écrit : >

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
nks a lot ! > Nicolas > > > - Mail original - > De: "Jörn Franke" <jornfra...@gmail.com> > À: nib...@free.fr, "Brett Antonides" <banto...@gmail.com> > Cc: user@spark.apache.org > Envoyé: Samedi 3 Octobre 2015 11:17:51 > Objet: Re: H

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
pose the records are still updatable. > > Tks to confirm if it can be solution for my use case. Or any other idea.. > > Thanks a lot ! > Nicolas > > > - Mail original - > De: "Jörn Franke" <jornfra...@gmail.com> > À: nib...@free.fr, "Brett Anto

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
> > Nicolas > > > > > Envoyé depuis mon appareil mobile Samsung > > Jörn Franke <jornfra...@gmail.com> a écrit : > > If you use transactional tables in hive together with insert, update, > delete then it does the "concatenate " for you a

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
t; > > On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote: > > > Hello, > Yes but : > - In the Java API I don't find a API to create a HDFS archive > - As soon as I receive a message (with messageID) I need to replace the > old existing file by the new one (na

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
. 2015 à 16:48, <nib...@free.fr> a écrit : > Thanks a lot, why you said "the most recent version" ? > > - Mail original - > De: "Jörn Franke" <jornfra...@gmail.com> > À: "nibiau" <nib...@free.fr> > Cc: banto...@gmail.com, user

Re: HDFS small file generation problem

2015-09-28 Thread Jörn Franke
Use hadoop archive Le dim. 27 sept. 2015 à 15:36, a écrit : > Hello, > I'm still investigating my small file generation problem generated by my > Spark Streaming jobs. > Indeed, my Spark Streaming jobs are receiving a lot of small events (avg > 10kb), and I have to store them

Re: Spark Ingestion into Relational DB

2015-09-22 Thread Jörn Franke
Once the data is consolidated in Oracle, it serves as the source > of truth for external users. > > Regards, > Sri Eswari. > > On Mon, Sep 21, 2015 at 10:55 PM, Jörn Franke <jornfra...@gmail.com> > wrote: > >> You do not need Hadoop. However, you should think abou

Re: Spark Ingestion into Relational DB

2015-09-21 Thread Jörn Franke
You do not need Hadoop. However, you should think about using it. If you use Spark to load data directly from Oracle then your database might have unexpected loads of data once a Spark node may fail. Additionally, the Oracle Database, if it is not based on local disk, may have a storage

Re: Using Spark for portfolio manager app

2015-09-20 Thread Jörn Franke
l storage lead to high latency in my app. > > 3/ How to get real-time statistics from Spark, > In most of the Spark streaming examples, the statistics are echo to the > stdout. > However, I want to display those statics on GUI, is there any way to > retrieve data from Spark directl

Re: Using Spark for portfolio manager app

2015-09-19 Thread Jörn Franke
If you want to be able to let your users query their portfolio then you may want to think about storing the current state of the portfolios in hbase/phoenix or alternatively a cluster of relationaldatabases can make sense. For the rest you may use Spark. Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê

Re: Spark Streaming Suggestion

2015-09-14 Thread Jörn Franke
Why did you not stay with the batch approach? For me the architecture looks very complex for a simple thing you want to achieve. Why don't you process the data already in storm ? Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi a écrit : > I am pretty new to spark.

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Jörn Franke
I fear you have to do the plumbing all yourself. This is the same for all commercial and non-commercial libraries/analytics packages. It often also depends on the functional requirements on how you distribute. Le sam. 12 sept. 2015 à 20:18, Rex X a écrit : > Hi everyone, > >

Re: SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-12 Thread Jörn Franke
I am not sure what are you trying to achieve here. Have you thought about using flume? Additionally maybe something like rsync? Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar a écrit : > Hi all, >I have a coded a custom receiver which receives kafka messages.

Re: Best way to import data from Oracle to Spark?

2015-09-08 Thread Jörn Franke
What do you mean by import? All ways have advantages and disadvantages. You may first think about when you can make large extractions of data from the database into Spark. You also may think about if the database should be the persistent storage of the data or if you need something aside of the

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-07 Thread Jörn Franke
Can you use a map or list with different properties as one parameter? Alternatively a string where parameters are Comma-separated... Le lun. 7 sept. 2015 à 8:35, Night Wolf a écrit : > Is it possible to have a UDF which takes a variable number of arguments? > > e.g.

Re: SparkR / MLlib Integration

2015-09-04 Thread Jörn Franke
You can always use the ml libs in R, but you have to integrate them in sparkr (= make all the logic to run in parallel etc). However, for your use case it may make more sense to write the wrapper R mllib yourself, if the project cannot provide it in time. It is not that difficult to call java or

Re: Small File to HDFS

2015-09-04 Thread Jörn Franke
Maybe you can tell us more about your use case, I have somehow the feeling that we are missing sth here Le jeu. 3 sept. 2015 à 15:54, Jörn Franke <jornfra...@gmail.com> a écrit : > > Store them as hadoop archive (har) > > Le mer. 2 sept. 2015 à 18:07, <nib...@free.fr&

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
Store them as hadoop archive (har) Le mer. 2 sept. 2015 à 18:07, a écrit : > Hello, > I'am currently using Spark Streaming to collect small messages (events) , > size being <50 KB , volume is high (several millions per day) and I have to > store those messages in HDFS. > I

Re: Slow Mongo Read from Spark

2015-09-03 Thread Jörn Franke
You might think about another storage layer not being mongodb (hdfs+orc+compression or hdfs+parquet+compression) to improve performance Le jeu. 3 sept. 2015 à 9:15, Akhil Das a écrit : > On SSD you will get around 30-40MB/s on a single machine (on 4 cores). > >

Re: Hbase Lookup

2015-09-03 Thread Jörn Franke
guha <guha.a...@gmail.com> wrote: > >> Thanks for your info. I am planning to implement a pig udf to do record >> look ups. Kindly let me know if this is a good idea. >> >> Best >> Ayan >> >> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jo

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
o use Pig on it > and what about performances ? > > - Mail original - > De: "Jörn Franke" <jornfra...@gmail.com> > À: nib...@free.fr, user@spark.apache.org > Envoyé: Jeudi 3 Septembre 2015 15:54:42 > Objet: Re: Small File to HDFS > > > > > Store them as h

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
(remove/replace) a file inside the HAR ? > Basically the name of my small files will be the keys of my records , and > sometimes I will need to replace the content of a file by a new content > (remove/replace) > > > Tks a lot > Nicolas > > ----- Mail original - >

Re: Ranger-like Security on Spark

2015-09-03 Thread Jörn Franke
Well if it needs to read from hdfs then it will adhere to the permissions defined there And/or in ranger. However, I am not aware that you can protect dataframes, tables or streams in general in Spark. Le jeu. 3 sept. 2015 à 21:47, Daniel Schulz a écrit : > Hi

Re: Hbase Lookup

2015-09-02 Thread Jörn Franke
You may check if it makes sense to write a coprocessor doing an upsert for you, if it does not exist already. Maybe phoenix for Hbase supports this already. Another alternative, if the records do not have an unique Id, is to put them into a text index engine, such as Solr or Elasticsearch, which

Re: How mature is spark sql

2015-09-01 Thread Jörn Franke
Depends on what you need to do. Can you tell more about your use cases? Le mar. 1 sept. 2015 à 13:07, rakesh sharma a écrit : > Is it mature enough to use it extensively. I see that it is easier to do > than writing map/reduce in java. > We are being asked to do it

Re: Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Jörn Franke
I think there is already an example for this shipped with Spark. However, you do not benefit really from any spark functionality for this scenario. If you want to do something more advanced you should look at Elasticsearch or Solr Le ven. 28 août 2015 à 16:15, Darksu nick_tou...@hotmail.com a

Re: Efficient sampling from a Hive table

2015-08-26 Thread Jörn Franke
Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want Le mer. 26 août 2015 à 18:12, Thomas Dudziak tom...@gmail.com a écrit : Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26,

Re: JDBC Streams

2015-08-26 Thread Jörn Franke
I would use Sqoop. It has been designed exactly for these types of scenarios. Spark streaming does not make sense here Le dim. 5 juil. 2015 à 1:59, ayan guha guha.a...@gmail.com a écrit : Hi All I have a requireent to connect to a DB every few minutes and bring data to HBase. Can anyone

Re: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Jörn Franke
Hi, First you need to make your SLA clear. It does not sound for me they are defined very well or that your solution is necessary for the scenario. I also find it hard to believe that 1 customer has 100Million transactions per month. Time series data is easy to precalculate - you do not

Re: Spark job workflow engine recommendations

2015-08-07 Thread Jörn Franke
Check also falcon in combination with oozie Le ven. 7 août 2015 à 17:51, Hien Luu h...@linkedin.com.invalid a écrit : Looks like Oozie can satisfy most of your requirements. On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote: Hi, I'm looking for open source workflow

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread Jörn Franke
Yes you should use orc it is much faster and more compact. Additionally you can apply compression (snappy) to increase performance. Your data processing pipeline seems to be not.very optimized. You should use the newest hive version enabling storage indexes and bloom filters on appropriate

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread Jörn Franke
in join statements (not where) otherwise you do a full table scan ignoring partitions Le jeu. 6 août 2015 à 15:07, Jörn Franke jornfra...@gmail.com a écrit : Yes you should use orc it is much faster and more compact. Additionally you can apply compression (snappy) to increase performance. Your

Re: Transform MongoDB Aggregation into Spark Job

2015-08-04 Thread Jörn Franke
Hi, I think the combination of Mongodb and Spark is a little bit unlucky. Why don't you simply use mongodb? If you want to process a lot of data you should use hdfs or cassandra as storage. Mongodb is not suitable for heterogeneous processing of large scale data. Best regards Best regards,

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
I think you use case can already be implemented with HDFS encryption and/or SealedObject, if you look for sth like Altibase. If you create a JIRA you may want to set the bar a little bit higher and propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/ Le ven. 31 juil. 2015 à 10:17,

Re: HiveQL to SparkSQL

2015-07-29 Thread Jörn Franke
What Hive Version are you using? Do you run it in on TEZ? Are you using the ORC Format? Do you use compression? Snappy? Do you use Bloom filters? Do you insert the data sorted on the right columns? Do you use partitioning? Did you increase the replication factor for often used tables or

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Jörn Franke
You may check out apache phoenix on top of Hbase for this. However, it does not have ODBC drivers, but JDBC ones. Maybe Hive 1.2 with a new version of TEZ will also serve your purpose. You should run some proof of concept with these technologies using real or generated data. About how much data

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jörn Franke
Can you put some transparent cache in front of the database? Or some jdbc proxy? Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a écrit : can the source write to Kafka/Flume/Hbase in addition to Postgres? no it can't write ,this is due to the fact that there are many

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Jörn Franke
Use a Hadoop distribution that supports Windows and has Spark included. Generally - if you want to use windows - you should use the server version. Le sam. 25 juil. 2015 à 20:11, Peter Leventis pleven...@telkomsa.net a écrit : I just wanted an easy step by step guide as to exactly what version

Re: Twitter4J streaming question

2015-07-23 Thread Jörn Franke
He should still see something. I think you need to subscribe to the Screenname first and not filter it out only in the filter method. I do not have the apis from mobile at hand, but there should be a method. Le jeu. 23 juil. 2015 à 22:30, Enno Shioji eshi...@gmail.com a écrit : You need to pay

Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix, otherwise you can try orc with storage index and/or compression, but this depends on how your queries look like Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a écrit : HI

Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
choice? i dont dont to iterate the result set which Hbase returns and give the result because this will kill the performance? On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote: Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix

Re: How to restart Twitter spark stream

2015-07-19 Thread Jörn Franke
Why do you even want to stop it? You can join it with a rdd loading the newest hash tags from disk in a regular interval Le dim. 19 juil. 2015 à 7:40, Zoran Jeremic zoran.jere...@gmail.com a écrit : Hi, I have a twitter spark stream initialized in the following way: val

Re: Research ideas using spark

2015-07-15 Thread Jörn Franke
Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing).

Re: Spark performance

2015-07-11 Thread Jörn Franke
What is your business case for the move? Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani rrav...@gmail.com a écrit : Hi everyone, I have planned to move mssql server to spark?. I have using around 50,000 to 1l records. The spark performance is slow when compared to mssql server. What is

Re: Spark performance

2015-07-11 Thread Jörn Franke
Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a écrit : Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines,

Re: Spark performance

2015-07-11 Thread Jörn Franke
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov ole...@gmail.com a écrit : Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor

Re: .NET on Apache Spark?

2015-07-05 Thread Jörn Franke
Ironpython shares with python only the syntax - at best. It is a scripting language within the .NET framework. Many applications have this for scripting the application itself. This won't work for you. You can use pipes or write your spark jobs in java/scala/r and submit them via your .NET

Re: Spark custom streaming receiver not storing data reliably?

2015-07-05 Thread Jörn Franke
Can you provide the result set you are using and specify how you integrated the drools engine? Drools basically is based on a large shared memory. Hence, if you have several tasks in Shark they end up using different shared memory areas. A full integration of drools requires some sophisticated

Re: Help optimising Spark SQL query

2015-06-22 Thread Jörn Franke
Generally (not only spark sql specific) you should not cast in the where part of a sql query. It is also not necessary in your case. Getting rid of casts in the whole query will be also beneficial. Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a écrit : Hello, A colleague

Re: Spark not working on windows 7 64 bit

2015-06-10 Thread Jörn Franke
You may compare the c:\windows\system32\drivers\etc\hosts if they are configured similarly Le mer. 10 juin 2015 à 17:16, Eran Medan eran.me...@gmail.com a écrit : I'm on a road block trying to understand why Spark doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Jörn Franke
I am not sure they work with HDFS pathes. You may want to look at the source code. Alternatively you can create a fat jar containing all jars (let your build tool set correctly METAINF). This always works. Le mer. 10 juin 2015 à 6:22, Dong Lei dong...@microsoft.com a écrit : Thanks So much!

Re: how to make a spark cluster ?

2015-04-19 Thread Jörn Franke
Hi, If you have just one physical machine then I would try out Docker instead of a full VM (would be waste of memory and CPU). Best regards Le 20 avr. 2015 00:11, hnahak harihar1...@gmail.com a écrit : Hi All, I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I just need

Re: Pseudo Spark Streaming ?

2015-04-05 Thread Jörn Franke
Hallo, Only because you receive the log files hourly it means that you have to use Spark Streaming. Spark streaming is often used if you receive new events each minute /second potentially at an irregular frequency. Of course your analysis window can be larger. I think your use case justifies

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Jörn Franke
Hallo, Well all problems you want to solve with technology need to have good justification for a certain technology. So the first thing is that you ask which technology fits to my current and future problems. This is also what the article says. Unfortunately, it does only provide a vague answer

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
You can also preaggregate results for the queries by the user - depending on what queries they use this might be necessary for any underlying technology Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit : Hi, I need to store terabytes of data which will be used for BI tools

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
, kundan kumar iitr.kun...@gmail.com wrote: I looking for some options and came across http://www.jethrodata.com/ On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com wrote: You can also preaggregate results for the queries by the user - depending on what queries they use

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Jörn Franke
You probably need to add the dll directory to the path (not classpath!) environment variable on all nodes. Le 26 mars 2015 06:23, Xi Shen davidshe...@gmail.com a écrit : Not of course...all machines in HDInsight are Windows 64bit server. And I have made sure all my DLLs are for 64bit machines.

Re: HIVE SparkSQL

2015-03-18 Thread Jörn Franke
Hallo, Depending non your needs, search technology, such as SolrCloud or ElasticSearch makes more sense. If you go for the Cassandra solution you can use the lucene text indexer... I am not sure if hive or sparksql are very suitable for text. However, if you do not need text search then feel free

Re: Scalable JDBCRDD

2015-03-01 Thread Jörn Franke
What database are you using? Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit : Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not

Re: Spark Streaming and message ordering

2015-02-20 Thread Jörn Franke
You may think as well if your use case really needs a very strict order, because configuring spark that it supports such a strict order means rendering most of benefits useless (failure handling, parallelism etc.). Usually, in a distributed setting you can order events, but this also means that

Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Jörn Franke
Hi, What do your jobs do? Ideally post source code, but some description would already helpful to support you. Memory leaks can have several reasons - it may not be Spark at all. Thank you. Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit : (looks like the list didn't like

Re: Eclipse on spark

2015-01-25 Thread Jörn Franke
I recommend using a build tool within eclipse, such as Gradle or Maven Le 24 janv. 2015 19:34, riginos samarasrigi...@gmail.com a écrit : How to compile a Spark project in Scala IDE for Eclipse? I got many scala scripts and i no longer want to load them from scala-shell what can i do? --

Re: processing large dataset

2015-01-22 Thread Jörn Franke
Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit : I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've

Re: spark streaming with checkpoint

2015-01-22 Thread Jörn Franke
Maybe you use a wrong approach - try something like hyperloglog or bitmap structures as you can find them, for instance, in redis. They are much smaller Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a écrit : Thank you Jerry, Does the window operation create new

Re: Join DStream With Other Datasets

2015-01-17 Thread Jörn Franke
Can't you send a special event through spark streaming once the list is updated? So you have your normal events and a special reload event Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit : Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Jörn Franke
Basically, you have to think about how to split the data (for pictures this can be for instance 8x8 matrices) and use spark to distribute it to different workers which themselves call opencv with the data. Afterwards you need to combine all results again. It really depends on your image / video

Re: Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Jörn Franke
Hallo, Based on experiences with other software in virtualized environments I cannot really recommend this. However, I am not sure how Spark reacts. You may face unpredictable task failures depending on utilization, tasks connecting to external systems (databases etc.) may fail unexpectedly and

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-04 Thread Jörn Franke
Hallo, It really depends on your requirements, what kind of machine learning algorithm your budget, if you do currently something really new or integrate it with an existing application, etc.. You can run MongoDB as well as a cluster. I don't think this question can be answered generally, but

Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread Jörn Franke
Hi, What is your cluster setup? How mich memory do you have? How much space does one row only consisting of the 3 columns consume? Do you run other stuff in the background? Best regards Am 04.12.2014 23:57 schrieb bonnahu bonn...@gmail.com: I am trying to load a large Hbase table into SPARK

RE: Problem executing Spark via JBoss application

2014-10-16 Thread Jörn Franke
Do you create the application in context of the web service call? Then the application maybe killed after you return from the web service call. However, we would need to see what you do during the web service call, how you invoke the spark application Le 16 oct. 2014 08:50, Mehdi Singer

<    1   2   3   4   5   6   >